# Modular > Deploy fast and scalable GenAI inference This file contains all documentation content in a single document following the llmtxt.org standard. ## @__copy_capture You can add the `__copy_capture` decorator on a parametric closure to capture register-passable values by copy. This decorator causes a nested function to copy the value of the indicated variable into the closure object at the point of formation instead of capturing that variable by reference. This allows the closure to be passed as an escaping function, without lifetime concerns. ```mojo fn foo(x: Int): var z = x @__copy_capture(z) @parameter fn formatter() -> Int: return z z = 2 print(formatter()) fn main(): foo(5) ``` --- ## @always_inline You can add the `@always_inline` decorator on any function to make the Mojo compiler "inline" the body of the function (copy it) directly into the body of the calling function. This eliminates potential performance costs associated with function calls jumping to a new point in code. Normally, the compiler will do this automatically where it can improve performance, but this decorator forces it to do so. The downside is that it can increase the binary size by duplicating the function at every call site. For example: ```mojo @always_inline fn add(a: Int, b: Int) -> Int: return a + b print(add(1, 2)) ``` Because `add()` is decorated with `@always_inline`, Mojo compiles this program without adding the `add()` function to the call stack, and it instead performs the addition directly at the `print()` call site, as if it were written like this: ```mojo print(1 + 2) ``` ## `@always_inline("nodebug")` You can also use the decorator with the `"nodebug"` argument, which has the same effect to inline the function, but without debug information. This means that you can't step into the function when debugging. This decorator is intended to be used on the lowest-level functions in a library, which may wrap primitive functions, MLIR operations, or inline assembly. Marking these functions as "nodebug" prevents users from accidentally stepping into low-level non-Mojo code when debugging. --- ## @compiler.register The `@compiler.register` decorator registers a custom operation for use with the Graph API. For more information on custom operations, see [Intro to custom ops](/max/custom-ops). To define a custom operation: * Import the `compiler` package. * Create a struct that implements the `execute()` and (optional) `shape()` static methods. * Register it using the `@compiler.register` decorator. The following snippet shows the outline of a custom operation: ```mojo @compiler.register("add_vectors_custom") struct AddVectorsCustom: @staticmethod fn execute[...](...) raises: pass @staticmethod fn shape(...) raises -> IndexList: pass ``` The `@compiler.register` decorator takes a single arguments, the name of the custom operation, as a string. This name is used to load the custom op into your graph. Output from the `execute()` method is usually returned using one or more destination-passing style (DPS) output tensors. Destination-passing style (DPS) means that the calling function passes in pre-allocated storage space for the output value(s). This allows for more efficient memory management. For example, the graph compiler can optimize memory use by allocating output tensors on the stack, instead of requiring custom ops to allocate heap storage for return values. Destination passing style requires the graph compiler to determine the dimensions of the output tensor(s) before executing the operation. It uses the operation's `shape()` function to determine the dimensions if they can't be determined statically. The following sections describe the `execute()` and `shape()` functions. ### `execute()` function The `execute()` function performs the actual work of the custom op. It takes the following parameter: * `target` (`StaticString`): Indicates the device the operation is running on: currently takes the values `"cpu"` or `"gpu"`. Graph output and input tensors are passed to the `execute()` function as instances of [`OutputTensor`](/max/api/mojo/tensor/managed_tensor_slice/#aliases) and [`InputTensor`](/max/api/mojo/tensor/managed_tensor_slice/#aliases), respectively. These are both aliases for specific configurations of [`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice), so they both have the same API. In addition to input and output tensors, the function can take the following arguments: * Any arguments of type [`Scalar`](/mojo/manual/types#scalar-values). * A single argument of type `DeviceContextPtr`. This opaque pointer is currently required for GPU support. ```mojo import compiler from utils.index import IndexList from max.tensor import OutputTensor, InputTensor, foreach, ManagedTensorSlice from runtime.asyncrt import DeviceContextPtr @compiler.register("add_vectors_custom") struct AddVectorsCustom: @staticmethod fn execute[ # "gpu" or "cpu" target: StaticString, ]( # the first argument is the output out: OutputTensor, # starting here is the list of inputs x: InputTensor[type = out.type, rank = out.rank], y: InputTensor[type = out.type, rank = out.rank], # the context is needed for some GPU calls ctx: DeviceContextPtr, ) raises: @parameter @always_inline fn func[width: Int](idx: IndexList[x.rank]) -> SIMD[x.type, width]: return x.load[width](idx) + y.load[width](idx) foreach[func, target=target](out, ctx) ``` ### `shape()` function The `shape()` function returns the dimensions of the output tensor(s). The `shape()` function is required only if the graph compiler can't statically determine the shape of the output tensor(s), and you don't manually annotate the output shapes when building a graph. The function takes the same arguments as the `execute()` function, minus the output tensors and `DeviceContextPtr`. It must return an [`IndexList`](/mojo/stdlib/utils/index_/IndexList/) specifying the dimensions of the output tensor. For example, if the operation takes two input tensors, and the shape of the output tensor matches the first input tensor, you could use the following `shape()` function: ```mojo @staticmethod fn shape( in1: InputTensor, in2: InputTensor, ) raises -> IndexList[in1.rank]: return in1.spec.shape ``` --- ## @implicit You can add the `@implicit` decorator on any single-argument constructor to identify it as eligible for implicit conversion. For example: ```mojo struct MyInt: var value: Int @implicit fn __init__(out self, value: Int): self.value = value fn __init__(out self, value: Float64): self.value = Int(value) ``` This implicit conversion constructor allows you to pass an `Int` to a function that takes a `MyInt` argument, or assign an `Int` to a variable of type `MyInt`. However, the constructor that takes a `Float64` value is **not** an implicit conversion constructor, so it must be invoked explicitly: ```mojo fn func(n: MyInt): print("MyInt value: ", n.value) fn main(): func(Int(42)) # Implicit conversion from Int: OK func(MyInt(Float64(4.2))) # Explicit conversion from Float64: OK func(Float64(4.2)) # Error: can't convert Float64 to MyInt ``` --- ## @nonmaterializable You can add the `@nonmaterializable` decorator on a struct to declare that the type can exist only in the parameter domain (it can be used for metaprogramming only, and not as a runtime type). And, if an instance of this type does transition into the runtime domain, this decorator declares what type it becomes there. To use it, declare your type with `@nonmaterializable(TargetType)`, where `TargetType` is the type that the object should convert to if it becomes a runtime value (you must declare the `TargetType`). For example, if a struct is marked as `@nonmaterializable(Foo)`, then anywhere that it goes from a parameter value to a runtime value, it automatically converts into the `Foo` type. For example, the following `NmStruct` type can be used in the parameter domain, but the `converted_to_has_bool` instance of it is converted to `HasBool` when it's materialized as a runtime value: ```mojo @value @register_passable("trivial") struct HasBool: var x: Bool fn __init__(out self, x: Bool): self.x = x @always_inline("nodebug") fn __init__(out self, nms: NmStruct): self.x = True if (nms.x == 77) else False @value @nonmaterializable(HasBool) @register_passable("trivial") struct NmStruct: var x: Int @always_inline("nodebug") fn __add__(self, rhs: Self) -> Self: return NmStruct(self.x + rhs.x) alias still_nm_struct = NmStruct(1) + NmStruct(2) # When materializing to a run-time variable, it is automatically converted, # even without a type annotation. var converted_to_has_bool = still_nm_struct ``` :::note A non-materializable struct must have all of its methods annotated as `@always_inline`, and it must be computable in the parameter domain. ::: --- ## @parameter You can add the `@parameter` decorator on an `if` or `for` statement to run that code at compile time, or on a nested function to create a [parametric closure](#parametric-closure). ## Parametric `if` statement You can add `@parameter` to any `if` condition that's based on a valid parameter expression (it's an expression that evaluates at compile time). This ensures that only the live branch of the `if` statement is compiled into the program, which can reduce your final binary size. For example: ```mojo @parameter if True: print("this will be included in the binary") else: print("this will be eliminated at compile time") ``` ```output this will be included in the binary ``` ## Parametric `for` statement You can add the `@parameter` decorator to a `for` loop to create a loop that's "unrolled" at compile time. The loop sequence and induction values must be valid parameter expressions (that is, expressions that evaluate at compile time). For example, if you use `for i in range(LIMIT)`, the expression `range(LIMIT)` defines the loop sequence. This is a valid parameter expression if `LIMIT` is a parameter, alias, or integer literal. The compiler "unrolls" the loop by replacing the `for` loop with `LIMIT` copies of the loop body with different constant `i` values. You can use run-time expressions in the body of the loop (for example, in the following example, the `list`, `threshold`, and `count` variables are all run-time values). ```mojo from random import rand def main(): alias LIST_SIZE = 128 var list = List[Float64](length=LIST_SIZE, fill=0) rand(list.unsafe_ptr(), LIST_SIZE) var threshold = 0.6 var count = 0 @parameter for i in range(LIST_SIZE): if (list[i] > threshold): count += 1 print(StaticString("{} items over 0.6").format(count)) ``` The `@parameter for` construct unrolls at the beginning of compilation, which might explode the size of the program that still needs to be compiled, depending on the amount of code that's unrolled. Currently, `@parameter for` requires the sequence's `__iter__` method to return a `_StridedRangeIterator`, meaning the induction variables must be `Int`. The intention is to lift this restriction in the future. ## Parametric closure You can add `@parameter` on a nested function to create a “parametric” capturing closure. This means you can create a closure function that captures values from the outer scope (regardless of whether they are variables or parameters), and then use that closure as a parameter. For example: ```mojo fn use_closure[func: fn(Int) capturing [_] -> Int](num: Int) -> Int: return func(num) fn create_closure(): var x = 1 @parameter fn add(i: Int) -> Int: return x + i var y = use_closure[add](2) print(y) create_closure() ``` ```output 3 ``` Without the `@parameter` decorator, you'll get a compiler error that says you "cannot use a dynamic value in call parameter"—referring to the `use_closure[add](2)` call—because the `add()` closure would still be dynamic. Note the `[_]` in the function type: ```mojo fn use_closure[func: fn(Int) capturing [_] -> Int](num: Int) -> Int: ``` This origin specifier represents the set of origins for the values captured by the parametric closure. This allows the compiler to correctly extend the lifetimes of those values. For more information on lifetimes and origins, see [Lifetimes, origins and references](/mojo/manual/values/lifetimes). --- ## @register_passable You can add the `@register_passable` decorator on a struct to tell Mojo that the type should be passed in machine registers (such as a CPU register; subject to the details of the underlying architecture). For tiny data types like an integer or floating-point number, this is much more efficient than storing values in stack memory. This means the type is always passed by value and cannot be passed by reference. The basic `@register_passable` decorator does not change the fundamental behavior of a type: it still needs an `__init__()` and `__copyinit__()` method to be copyable (and it may have a `__del__()` method, if necessary). For example: ```mojo @register_passable struct Pair: var a: Int var b: Int fn __init__(out self, one: Int, two: Int): self.a = one self.b = two fn __copyinit__(out self, existing: Self): self.a = existing.a self.b = existing.b fn test_pair(): var x = Pair(5, 10) var y = x print(y.a, y.b) y.a = 10 y.b = 20 print(y.a, y.b) ``` ```mojo test_pair() ``` ```output 5 10 10 20 ``` This behavior is what we expect from `Pair`, with or without the decorator. You should be aware of a few other observable effects: 1. `@register_passable` types cannot hold instances of types that are not also `@register_passable`. 2. `@register_passable` types do not have a predictable identity, and so the `self` pointer is not stable/predictable (e.g. in hash tables). 3. `@register_passable` arguments and result are exposed to C and C++ directly, instead of being passed by-pointer. 4. `@register_passable` types cannot have a [`__moveinit__()` constructor](/mojo/manual/lifecycle/life#move-constructor), because values passed in a register cannot be passed by reference. ## `@register_passable("trivial")` Most types that use `@register_passable` are just "bags of bits," which we call "trivial" types. These trivial types are simple and should be copied, moved, and destroyed without any custom constructors or a destructor. For these types, you can add the `"trivial"` argument, and Mojo synthesizes all the lifecycle methods as appropriate for a trivial register-passable type: ```mojo @register_passable("trivial") struct Pair: var a: Int var b: Int ``` This is similar to the [`@value`](/mojo/manual/decorators/value) decorator, except when using `@register_passable("trivial")` the only lifecycle method you're allowed to define is the `__init__()` constructor (but you don't have to)—you *cannot* define any copy or move constructors or a destructor. Examples of trivial types include: * Arithmetic types such as `Int`, `Bool`, `Float64` etc. * Pointers (the address value is trivial, not the data being pointed to). * Arrays of other trivial types, including SIMD. For more information about lifecycle methods (constructors and destructors) see the section about [Value lifecycle](/mojo/manual/lifecycle/). :::note TODO This decorator is due for reconsideration. Lack of custom copy/move/destroy logic and "passability in a register" are orthogonal concerns and should be split. This former logic should be subsumed into a more general decorator, which is orthogonal to `@register_passable`. ::: --- ## @staticmethod You can add the `@staticmethod` decorator on a struct method to declare a static method. For example: ```mojo from collections import List from pathlib import Path struct MyStruct: var data: List[UInt8] fn __init__(out self): self.data = List[UInt8]() fn __moveinit__(out self, owned existing: Self): self.data = existing.data ^ @staticmethod fn load_from_file(file_path: Path) raises -> Self: var new_struct = MyStruct() new_struct.data = file_path.read_bytes() return new_struct ^ ``` Unlike an instance method, a static method doesn't take an implicit `self` argument. It's not attached to a specific instance of a struct, so it can't access instance data. For more information see the documentation on [static methods](/mojo/manual/structs#static-methods). --- ## @value You can add the `@value` decorator on a struct to generate boilerplate lifecycle methods, including the member-wise `__init__()` constructor, `__copyinit__()` copy constructor, and `__moveinit__()` move constructor. For example, consider a simple struct like this: ```mojo @value struct MyPet: var name: String var age: Int ``` Mojo sees the `@value` decorator and notices that you don't have any constructors and it synthesizes them for you, the result being as if you had actually written this: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, owned name: String, age: Int): self.name = name^ self.age = age fn __copyinit__(out self, existing: Self): self.name = existing.name self.age = existing.age fn __moveinit__(out self, owned existing: Self): self.name = existing.name^ self.age = existing.age ``` Mojo synthesizes each lifecycle method only when it doesn't exist, so you can use `@value` and still define your own versions to override the default behavior. For example, it is fairly common to use the default member-wise and move constructor, but create a custom copy constructor. For more information about these lifecycle methods, read [Life of a value](/mojo/manual/lifecycle/life). --- ## A step-by-step guide to Magic import GetMagic from '@site/src/includes/get_magic.mdx'; This guide will walk you through Magic, a command-line-based package management tool by Modular designed for fast, efficient, and scalable project management in Mojo and MAX environments. Whether you're managing dependencies across multiple platforms, setting up environments for specific tasks, or working with Python-based projects, `magic` CLI simplifies these processes and more. Built on the powerful [Pixi](https://prefix.dev/), Magic leverages its capabilities to provide seamless environment management and package handling for MAX and Mojo applications. :::note We recommend using Magic when developing in Mojo. If you're using the MAX framework with Python, we recommend installing our APIs and CLI tools via `pip` inside a Python virtual environment. For more details, see the [MAX install guide](/max/packages). ::: In this tutorial, we'll guide you through everything from setting up your first project and understanding the `magic` CLI to running Mojo code and building a FastAPI application. By the end, you'll have a solid grasp of how to configure and use magic effectively for streamlined project management, dependency handling, and environment setup. ## Step 1: Create a project The first step to using Magic is creating a new project. Run the following command: ```sh magic init hello-magic --format mojoproject ``` :::tip Documentation For detailed command options and examples, run `magic --help` in your terminal or explore the [magic commands reference](/magic/commands). ::: And navigate to the project directly: ```sh cd hello-magic ``` Your project structure will look like this: ```txt ├── .gitattributes ├── .gitignore ├── .magic ├── magic.lock └── mojoproject.toml ``` - `.magic` directory is used to store environment configurations and manages the dependencies for your project. This helps Magic keep your project isolated and ensures that different versions of dependencies won't conflict with other projects. The `.magic/envs` sub-directory specifically stores the virtual environments for your project. Unlike other package managers, Magic keeps your environment separate and clean, making it easy to manage and switch between different projects. - `mojoproject.toml` is a single TOML configuration file. - `magic.lock` is a critical file for ensuring reproducibility. It captures the exact versions of every dependency in your project. This ensures that when you or someone else runs your project in the future or on a different machine, Magic will install the exact same versions of packages. This avoids the common "it works on my machine" issue, providing consistency, especially in complex projects across different platforms. Please find more details in [Pixi lockfile](https://pixi.sh/latest/features/lockfile/). :::note Magic caches dependencies Magic uses a system wide cache for packages so creating extra environments do not take up more disk space. ::: ### Inspect `mojoproject.toml` The `mojoproject.toml` file defines your project's configuration. It contains sections like project metadata, dependencies, and channels, all in a single TOML file. Here's an example: ```txt [project] authors = ["Modular "] channels = ["https://conda.modular.com/max-nightly", "https://conda.modular.com/max", "https://repo.prefix.dev/modular-community", "conda-forge"] description = "Add a short description here" name = "hello-magic" platforms = ["osx-arm64"] version = "0.1.0" [dependencies] max = ">=24.4.0,=3.8,=24.4.0,=3.8,=0.114.0, PythonObject: torch = Python.import_module("torch") return torch.zeros(1) def main(): print(zero()) ``` Now, include the following to `main.py`: ```python import subprocess from fastapi import FastAPI, HTTPException @app.get("/zero") def zero(): try: p = subprocess.Popen( ["magic", "run", "mojo", "local/zero.mojo"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, ) while True: output = p.stdout.readline() if output == "" and p.poll() is not None: raise HTTPException( status_code=500, detail="Failed to produce zero" ) return {"message": f"answer is {output}"} except subprocess.SubprocessError as e: raise HTTPException(status_code=500, detail="Failed to execute subprocess") ``` and invoke `magic run dev-server` navigate to the `/zero` endpoint [http://127.0.0.1:8000/zero](http://127.0.0.1:8000/zero). We should see ``` {"message":"answer is tensor([0.])\n"} ``` Above, we are running `magic run mojo local/zero.mojo` from a subprocess. Another way is to first build the binary separately and earlier with: ```bash cd local && magic run mojo build zero.mojo ``` then navigate to the top repository and in runtime run the built binary instead: ```bash magic run bash -c local/zero ``` The latter takes advantage of the Mojo compiler whereas `mojo local/zero.mojo` uses the just-in-time (JIT) feature of the Mojo compiler. ## Step 6: Setup a test environment Testing is crucial in development, especially when dealing with complex dependencies. Using Magic, you can set up a dedicated testing environment that isolates your testing dependencies from your development dependencies. This ensures that your development environment remains clean and focused, while your test environment has everything it needs to run unit tests, integration tests, etc. Isolating these environments also helps prevent any accidental conflicts or issues during testing. To add a specific testing dependencies in a dedicated environment using Magic, first run: ```sh magic task add test "pytest" --feature test ``` which includes the following: ```txt [feature.test.tasks] test = "pytest" ``` Then we need to explicitly add the `default` environment: ```sh magic project environment add default --solve-group default ``` After than, we can include the `test` environment via: ```sh magic project environment add test --feature test --solve-group default ``` This adds the following configuration: ``` [environments] default = { solve-group = "default" } test = { features = ["test"], solve-group = "default" } ``` :::note Group dependencies Here `--solve-group` is a way to group dependencies together which is useful when having multiple environments sharing the same dependencies. Check out [pixi's multi-environment](https://pixi.sh/latest/features/multi_environment/) for more. ::: Finally, add `pytest` as a dependency for the test environment via `--feature`: ```sh magic add pytest --pypi --feature test ``` which includes the following in `mojoproject.toml`: ```txt [feature.test.pypi-dependencies] pytest = ">=8.3.2, Report feedback, including issues on our [MAX](https://github.com/modular/modular/issues) GitHub tracker. --- ## abort `abort[result: AnyType = None]() -> result` Calls a target dependent trap instruction if available. **Parameters:** * ​result (`AnyType`): The result type. **Returns:** A null result type. `abort[result: AnyType = None](message: String) -> result` Calls a target dependent trap instruction if available. **Parameters:** * ​result (`AnyType`): The result type. **Args:** * ​message (`String`): The message to include when aborting. **Returns:** A null result type. --- ## abs `abs(t: IntTuple[origin]) -> IntTuple` Compute the absolute value of each element in an `IntTuple`. This function applies the absolute value operation to each integer in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to transform. **Returns:** A new `IntTuple` with the same structure but with absolute values. --- ## abs `abs[T: Absable](value: T) -> T` Get the absolute value of the given object. **Parameters:** * ​T (`Absable`): The type conforming to Absable. **Args:** * ​value (`T`): The object to get the absolute value of. **Returns:** The absolute value of the object. --- ## abs `abs(x: ComplexSIMD[type, size]) -> SIMD[type, size]` Performs elementwise abs (norm) on each element of the complex value. **Args:** * ​x (`ComplexSIMD[type, size]`): The complex vector to perform absolute value on. **Returns:** The elementwise abs of x. --- ## Absable The `Absable` trait describes a type that defines an absolute value operation. Types that conform to `Absable` will work with the builtin `abs` function. The absolute value operation always returns the same type as the input. For example: ```mojo struct Point(Absable): var x: Float64 var y: Float64 fn __abs__(self) -> Self: return sqrt(self.x * self.x + self.y * self.y) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__abs__` `__abs__(self: _Self) -> _Self` Get the absolute value of this instance. **Returns:** The absolute value of the instance. --- ## AccessPolicyWindow `@register_passable(trivial)` `struct AccessPolicyWindow` Specifies an access policy for a window of memory. This struct defines a contiguous extent of memory beginning at base\_ptr and ending at base\_ptr + num\_bytes, with associated access policies. It allows fine-grained control over how memory is accessed and cached, which can significantly impact performance for memory-bound workloads. The window is partitioned into segments with different access properties based on the hit\_ratio. Accesses to "hit segments" use the hit\_prop policy, while accesses to "miss segments" use the miss\_prop policy. Note: The `num_bytes` value is limited by `CU_DEVICE_ATTRIBUTE_MAX_ACCESS_POLICY_WINDOW_SIZE`. The CUDA driver may align the `base_ptr` and restrict the maximum size. ## Fields * ​base\_ptr (`UnsafePointer[NoneType]`): Starting address of the access policy window. Driver may align it. * ​num\_bytes (`Int`): Size in bytes of the window policy. CUDA driver may restrict the maximum size and alignment. * ​hit\_ratio (`SIMD[float32, 1]`): Specifies percentage of lines assigned hit\_prop, rest are assigned miss\_prop. Value should be between 0.0 and 1.0. * ​hit\_prop (`AccessProperty`): AccessProperty applied to hit segments within the window. * ​miss\_prop (`AccessProperty`): AccessProperty applied to miss segments within the window. Must be either NORMAL or STREAMING. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Initializes a new AccessPolicyWindow with default values. `__init__[T: AnyType](*, base_ptr: UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin], count: Int, hit_ratio: SIMD[float32, 1], hit_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0)), miss_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))) -> Self` Initializes an `AccessPolicyWindow` for a typed memory region. **Parameters:** * ​T (`AnyType`): The type of data in the memory region. **Args:** * ​base\_ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the start of the memory region. * ​count (`Int`): Number of elements of type T in the memory region. * ​hit\_ratio (`SIMD[float32, 1]`): Fraction of the window that should use hit\_prop (0.0 to 1.0). * ​hit\_prop (`AccessProperty`): Access property for hit segments (default: NORMAL). * ​miss\_prop (`AccessProperty`): Access property for miss segments (default: NORMAL). ### `__str__` `__str__(self) -> String` Returns a string representation of the `AccessPolicyWindow`. **Returns:** A string representation of the `AccessPolicyWindow`. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of the `AccessPolicyWindow` to a writer. This method formats all the fields of the AccessPolicyWindow into a human-readable string representation and writes it to the provided writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The writer instance to write the formatted string to. --- ## AccessProperty `@register_passable(trivial)` `struct AccessProperty` Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields. This struct defines cache persistence properties that can be used with `AccessPolicyWindow` to control how data is cached during GPU memory accesses. It provides hints to the memory subsystem about the expected access patterns, which can improve performance for specific workloads. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `NORMAL` `alias NORMAL = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))` Normal cache persistence with default caching behavior. ### `PERSISTING` `alias PERSISTING = AccessProperty(__init__[__mlir_type.!pop.int_literal](2))` Persisting access is more likely to persist in cache, optimized for reused data. ### `STREAMING` `alias STREAMING = AccessProperty(__init__[__mlir_type.!pop.int_literal](1))` Streaming access is less likely to persist in cache, optimized for single-use data. ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two `AccessProperty` instances for equality. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have the same value, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two `AccessProperty` instances for inequality. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have different values, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two `AccessProperty` instances have the same value. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have the same value, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two `AccessProperty` instances have different values. **Args:** * ​other (`Self`): The `AccessProperty` to compare with. **Returns:** True if the instances have different values, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the `AccessProperty`. **Returns:** A string representation of the `AccessProperty`. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of the `AccessProperty` to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The writer instance to write the formatted string to. --- ## accumulate --- ## accumulate_wo_tile `accumulate_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](c_tile_size: Int, output: UnsafePointer[SIMD[output_dt, 1]], output_stride: Int, input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, partial_load_size: Int)` --- ## accumulate_wo_tile_1d `accumulate_wo_tile_1d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, S: Int, mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: Int, partial_load_filter_size: Int, w: Int, W: Int, dilation: Int)` Update one row in the output for a given (c, f) tile. **Parameters:** * ​micro\_kernel\_height (`Int`): Number of input points in register tiling. * ​micro\_kernel\_width (`Int`): Number of SIMD resgiters assigned to F. * ​simd\_size (`Int`): Number of elements in a SIMD register. * ​partial\_load\_filter (`Bool`): Whether using partial load for filter. * ​effected\_by\_padding (`Bool`): Whether the tile is effected by padding. * ​input\_dt (`DType`): DType of input. * ​filter\_dt (`DType`): DType of filter. **Args:** * ​c\_tile\_size (`Int`): Tile size in input channel. * ​S (`Int`): Filter window width. * ​acc (`_Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop]`): Pointer to register tile accumulator. * ​input (`UnsafePointer[SIMD[input_dt, 1]]`): Pointer to the first input point in WO tile. * ​input\_stride (`Int`): Stride between two input points, i.e., C w/ NHWC layout. * ​input\_stride\_to\_nbr (`Int`): Stride between an input point and its neighbor. * ​filter (`UnsafePointer[SIMD[filter_dt, 1]]`): Pointer to the first coef in the filter window. * ​filter\_stride (`Int`): Stride between two segments of size `micro_kernel_width * simd_size`. * ​filter\_stride\_to\_nbr (`Int`): Stride between between two neighbor coefs, i.e., CF w/ RSCF layout. * ​partial\_load\_filter\_size (`Int`): Size of partial load for filter. * ​w (`Int`): Coordinate in an input row. * ​W (`Int`): Input width. * ​dilation (`Int`): Convolution dilation. --- ## accumulate_wo_tile_2d `accumulate_wo_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, RS: IndexList[2], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[2], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[2], partial_load_filter_size: Int, hw: IndexList[2], HW: IndexList[2], dilation: IndexList[2])` --- ## accumulate_wo_tile_3d `accumulate_wo_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, QRS: IndexList[3], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[3], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[3], partial_load_filter_size: Int, dhw: IndexList[3], DHW: IndexList[3], dilation: IndexList[3])` --- ## acos `acos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `acos` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `acos` of the input. --- ## acosh `acosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `acosh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `acosh` of the input. --- ## activations The module contains implementations of activation functions. ## Functions * [​`elu`](./elu): Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$. * [​`gelu`](./gelu): Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$. * [​`gelu_approximate`](./gelu_approximate): Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$. * [​`relu`](./relu): Compute the Relu Op using the equation $max(0, x)$. * [​`relu_n1`](./relu_n1): Compute the Relu N1 Op using the equation $max(min(x,1),-1)$. * [​`sign`](./sign): Compute the sign (0, 1) of the input value. --- ## AddressSpace `@register_passable(trivial)` `struct AddressSpace` Address space of the pointer. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `GENERIC` `alias GENERIC = AddressSpace(0)` Generic address space. ## Methods ### `__init__` `__init__(value: Int) -> Self` Initializes the address space from the underlying integral value. **Args:** * ​value (`Int`): The address space value. `__init__(value: _GPUAddressSpace) -> Self` Initializes the address space from the underlying integral value. **Args:** * ​value (`_GPUAddressSpace`): The address space value. ### `__eq__` `__eq__(self, other: Self) -> Bool` True if the two address spaces are equal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are equal and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` True if the two address spaces are inequal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are inequal and False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` True if the two address spaces are equal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are equal and False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` True if the two address spaces are equal and False otherwise. **Args:** * ​other (`Self`): The other address space value. **Returns:** True if the two address spaces are equal and False otherwise. ### `value` `value(self) -> Int` The integral value of the address space. **Returns:** The integral value of the address space. ### `__int__` `__int__(self) -> Int` The integral value of the address space. **Returns:** The integral value of the address space. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__str__` `__str__(self) -> String` Gets a string representation of the AddressSpace. **Returns:** The string representation of the AddressSpace. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats the address space to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. --- ## advanced_indexing_getitem `advanced_indexing_getitem[input_rank: Int, index_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], input_tensor_fn: fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](out_tensor: NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin], in_tensor_strides: IndexList[input_rank], ctx: DeviceContextPtr)` Implement basic numpy-style advanced indexing. This is designed to be fused with other view-producing operations to implement full numpy-indexing semantics. This assumes the dimensions in `input_tensor` not indexed by index tensors are ":", ie selecting all indices along the slice. For example in numpy: ``` # rank(indices1) == 3 # rank(indices2) == 3 out_tensor = input_tensor[:, :, :, indices1, indices2, :, :] ``` We calculate the following for all valid valued indexing variables: ``` out_tensor[a, b, c, i, j, k, d, e] = input_tensor[ a, b, c, indices1[i, j, k], indices2[i, j, k], d, e ] ``` In this example `start_axis = 3` and `num_index_tensors = 2`. TODO(GEX-1951): Support boolean tensor mask support TODO(GEX-1952): Support non-contiguous indexing tensor case TODO(GEX-1953): Support fusion (especially view-fusion) **Parameters:** * ​input\_rank (`Int`): The rank of the input tensor. * ​index\_rank (`Int`): The rank of the indexing tensors. * ​input\_type (`DType`): The dtype of the input tensor. * ​index\_type (`DType`): The dtype of the indexing tensors. * ​start\_axis (`Int`): The first dimension in input where the indexing tensors are applied. It is assumed the indexing tensors are applied in consecutive dimensions. * ​num\_index\_tensors (`Int`): The number of indexing tensors. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will appear under. * ​input\_tensor\_fn (`fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the input tensor. * ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors. **Args:** * ​out\_tensor (`NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin]`): The output tensor to write to. * ​in\_tensor\_strides (`IndexList[input_rank]`): The strides of the input tensor. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## advanced_indexing_getitem_shape `advanced_indexing_getitem_shape[input_rank: Int, index_rank: Int, //, start_axis: Int, num_index_tensors: Int](input_shape: IndexList[input_rank], index_shape: IndexList[index_rank]) -> IndexList[((num_index_tensors * -1) + index_rank + input_rank)]` Calculate the output shape from advanced indexing. **Parameters:** * ​input\_rank (`Int`): The rank of the input tensor. * ​index\_rank (`Int`): The rank of the indexing tensors. * ​start\_axis (`Int`): The first dimension in input where the indexing tensors are applied. It is assumed the indexing tensors are applied in consecutive dimensions. * ​num\_index\_tensors (`Int`): The number of indexing tensors. **Args:** * ​input\_shape (`IndexList[input_rank]`): The shape of the input tensor in the operation. * ​index\_shape (`IndexList[index_rank]`): The shape of the indexing tensors in the operation. --- ## advanced_indexing_setitem_inplace `advanced_indexing_setitem_inplace[input_rank: Int, index_rank: Int, updates_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], updates_tensor_fn: fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](input_tensor: NDBuffer[input_type, input_rank, origin], index_tensor_shape: IndexList[index_rank, element_type=element_type], updates_tensor_strides: IndexList[updates_rank], ctx: DeviceContextPtr)` Implement basic numpy-style advanced indexing with assignment. This is designed to be fused with other view-producing operations to implement full numpy-indexing semantics. This assumes the dimensions in `input_tensor` not indexed by index tensors are ":", ie selecting all indices along the slice. For example in numpy: ``` # rank(indices1) == 2 # rank(indices2) == 2 # rank(updates) == 2 input_tensor[:, :, :, indices1, indices2, :, :] = updates ``` We calculate the following for all valid valued indexing variables: ``` input_tensor[ a, b, c, indices1[i, j], indices2[i, j], d, e ] = updates[i, j] ``` In this example `start_axis = 3` and `num_index_tensors = 2`. In terms of implementation details, our strategy is to iterate over all indices over a common iteration range. The idea is we can map indices in this range to the write location in `input_tensor` as well as the data location in `updates`. An update can illustrate how this is possible best: Imagine the `input_tensor` shape is \[A, B, C, D] and we have indexing tensors I1 and I2 with shape \[M, N, K]. Assume I1 and I2 are applied to dimensions 1 and 2. I claim an appropriate common iteration range is then (A, M, N, K, D). Note we expect `updates` to be the shape \[A, M, N, K, D]. We will show this by providing the mappings into `updates` and `input_tensor`: Consider an arbitrary set of indices in this range (a, m, n, k, d): \- The index into `updates` is (a, m, n, k, d). \- The index into `input_tensor` is (a, I1\[m, n, k], I2\[m, n, k], d). TODO(GEX-1951): Support boolean tensor mask support TODO(GEX-1952): Support non-contiguous indexing tensor case TODO(GEX-1953): Support fusion (especially view-fusion) TODO(GEX-1954): Unify getitem and setitem using generic views. (Requires non-strided view functions). **Parameters:** * ​input\_rank (`Int`): The rank of the input tensor. * ​index\_rank (`Int`): The rank of the indexing tensors. * ​updates\_rank (`Int`): The rank of the updates tensor. * ​input\_type (`DType`): The dtype of the input tensor. * ​index\_type (`DType`): The dtype of the indexing tensors. * ​start\_axis (`Int`): The first dimension in input where the indexing tensors are applied. It is assumed the indexing tensors are applied in consecutive dimensions. * ​num\_index\_tensors (`Int`): The number of indexing tensors. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will appear under. * ​updates\_tensor\_fn (`fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the update tensor. * ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors. **Args:** * ​input\_tensor (`NDBuffer[input_type, input_rank, origin]`): The input tensor being indexed into and modified in-place. * ​index\_tensor\_shape (`IndexList[index_rank, element_type=element_type]`): The shape of each index tensor. * ​updates\_tensor\_strides (`IndexList[updates_rank]`): The strides of the update tensor. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## AI glossary import MDXListing from '@site/src/components/Listing/MDXListing'; export const terms = [ '*.mdx' ] --- ## algorithm Implements the algorithm package. ## Modules * [​`functional`](/mojo/stdlib/algorithm/functional/): Implements higher-order functions. * [​`memory`](/mojo/stdlib/algorithm/memory/): Implements `parallel_memcpy`. * [​`reduction`](/mojo/stdlib/algorithm/reduction/): Implements SIMD reductions. --- ## AlibiScoreMod `@register_passable(trivial)` `struct AlibiScoreMod[num_heads: Int]` AlibiScoreMod adds the appropriate ALiBi constant bias to attention score. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `ScoreModTrait`, `UnknownDestructibility` ## Aliases ### `name_str` `alias name_str = __init__[__mlir_type.!kgen.string]("alibi")` ## Methods ### `score_mod` `score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int) -> SIMD[type, width]` --- ## align_down `align_down(value: Int, alignment: Int) -> Int` Returns the closest multiple of alignment that is less than or equal to value. **Args:** * ​value (`Int`): The value to align. * ​alignment (`Int`): Value to align to. **Returns:** Closest multiple of the alignment that is less than or equal to the input value. In other words, floor(value / alignment) \* alignment. `align_down(value: UInt, alignment: UInt) -> UInt` Returns the closest multiple of alignment that is less than or equal to value. **Args:** * ​value (`UInt`): The value to align. * ​alignment (`UInt`): Value to align to. **Returns:** Closest multiple of the alignment that is less than or equal to the input value. In other words, floor(value / alignment) \* alignment. --- ## align_down_residual `align_down_residual(value: Int, alignment: Int) -> Int` Returns the remainder after aligning down value to alignment. **Args:** * ​value (`Int`): The value to align. * ​alignment (`Int`): Value to align to. **Returns:** The remainder after aligning down value to the closest multiple of alignment. In other words, value - align\_down(value, alignment). --- ## align_up `align_up(value: Int, alignment: Int) -> Int` Returns the closest multiple of alignment that is greater than or equal to value. **Args:** * ​value (`Int`): The value to align. * ​alignment (`Int`): Value to align to. **Returns:** Closest multiple of the alignment that is greater than or equal to the input value. In other words, ceiling(value / alignment) \* alignment. `align_up(value: UInt, alignment: UInt) -> UInt` Returns the closest multiple of alignment that is greater than or equal to value. **Args:** * ​value (`UInt`): The value to align. * ​alignment (`UInt`): Value to align to. **Returns:** Closest multiple of the alignment that is greater than or equal to the input value. In other words, ceiling(value / alignment) \* alignment. --- ## alignof `alignof[type: AnyType, target: target = _current_target()]() -> Int` Returns the align of (in bytes) of the type. **Parameters:** * ​type (`AnyType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The alignment of the type in bytes. `alignof[dtype: DType, target: target = _current_target()]() -> Int` Returns the align of (in bytes) of the dtype. **Parameters:** * ​dtype (`DType`): The DType in question. * ​target (`target`): The target architecture. **Returns:** The alignment of the dtype in bytes. --- ## all `all[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool` Checks if **all** elements in the list are truthy. **Parameters:** * ​T (`Boolable & Copyable & Movable`): The type of elements to check. **Args:** * ​list (`List[T, hint_trivial_type]`): The list to check. **Returns:** `True` if **all** elements in the list are truthy, `False` otherwise. `all[T: Boolable & KeyElement, //](set: Set[T]) -> Bool` Checks if **all** elements in the set are truthy. **Parameters:** * ​T (`Boolable & KeyElement`): The type of elements to check. **Args:** * ​set (`Set[T]`): The set to check. **Returns:** `True` if **all** elements in the set are truthy, `False` otherwise. `all(value: SIMD[dtype, size]) -> Bool` Checks if **all** elements in the simd vector are truthy. **Args:** * ​value (`SIMD[dtype, size]`): The simd vector to check. **Returns:** `True` if **all** elements in the simd vector are truthy, `False` otherwise. --- ## all_true `all_true(src: NDBuffer[type, 1, origin]) -> Bool` Returns True if all the elements in a buffer are True and False otherwise. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** True if all of the elements of the buffer are True and False otherwise. --- ## allgather `allgather[type: DType, rank: Int, ngpus: Int, //](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], ctxs: List[DeviceContext])` Performs all-gather across GPUs. **Parameters:** * ​type (`DType`): DType - The data type of tensor elements. * ​rank (`Int`): Int - Number of dimensions in input tensors. * ​ngpus (`Int`): Int - Number of GPUs participating in all-gather. **Args:** * ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Input buffers from each GPU. * ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Output buffers for each GPU. * ​ctxs (`List[DeviceContext]`): List of device contexts for participating GPUs. --- ## allgather Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer. ## Functions * [​`allgather`](/mojo/stdlib/gpu/comm/allgather/allgather): Performs all-gather across GPUs. --- ## allreduce `allreduce[type: DType, rank: Int, ngpus: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = Optional(None))` Performs an allreduce operation across multiple GPUs. This function serves as the main entry point for performing allreduce operations across multiple GPUs. It automatically selects between two implementations: * A peer-to-peer (P2P) based implementation when P2P access is possible between GPUs * A naive implementation as fallback when P2P access is not available The allreduce operation combines values from all GPUs using element-wise addition and distributes the result back to all GPUs. Note: * Input and output buffers must have identical shapes across all GPUs. * The number of elements must be identical across all input/output buffers. * Performance is typically better with P2P access enabled between GPUs. **Parameters:** * ​type (`DType`): The data type of the tensor elements (e.g. DType.float32). * ​rank (`Int`): The number of dimensions in the input/output tensors. * ​ngpus (`Int`): The number of GPUs participating in the allreduce. * ​outputs\_lambda (`fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None`): An output elementwise lambda. **Args:** * ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of input tensors from each GPU, one per GPU. * ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of output tensors for each GPU to store results. * ​rank\_sigs (`InlineArray[UnsafePointer[Signal], 8]`): Array of Signal pointers used for cross-GPU synchronization. * ​ctxs (`List[DeviceContext]`): List of device contexts for each participating GPU. * ​\_max\_num\_blocks (`Optional[Int]`): Optional maximum number of blocks used to compute grid configuration. If not passed a dispatch table sets the grid configuration. --- ## allreduce Multi-GPU allreduce implementation for efficient tensor reduction across GPUs. This module provides an optimized implementation of allreduce operations across multiple GPUs, supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation automatically selects between two approaches based on hardware capabilities: 1. P2P-based implementation (when P2P access is available): * Uses direct GPU-to-GPU memory access for better performance * Implements both single-stage and two-stage algorithms: * Single-stage for latency-bound transfers (small tensors) * Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors) * Optimized for NVLink bandwidth utilization * Uses vectorized memory access and higher precision accumulation 2. Non-P2P fallback implementation: * Copies data through host memory when direct GPU access isn't possible * Simple but functional approach for systems without P2P support The implementation is tuned for common GPU architectures (A100, H100) and includes parameters that can be adjusted for different hardware configurations. Limitations: * Number of elements must be a multiple of SIMD width * Maximum of 8 GPUs supported * All input/output buffers must have identical shapes ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None` ### `MAX_GPUS` `alias MAX_GPUS = 8` Maximum number of GPUs supported in the allreduce implementation. This constant sets the upper bound for the number of GPUS supported in this algorithm. ### `MAX_NUM_BLOCKS_UPPER_BOUND` `alias MAX_NUM_BLOCKS_UPPER_BOUND = 512` Maximum number of thread blocks to use for reduction kernels. This value has been empirically optimized through grid search across different GPU architectures. While this value is optimal for A100 GPUs, H100 GPUs may benefit from more blocks to fully saturate NVLink bandwidth. ## Structs * [​`Signal`](/mojo/stdlib/gpu/comm/allreduce/Signal): A synchronization primitive for coordinating GPU thread blocks across multiple devices. ## Functions * [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/allreduce): Performs an allreduce operation across multiple GPUs. * [​`can_enable_p2p`](/mojo/stdlib/gpu/comm/allreduce/can_enable_p2p): If peer-to-peer access is supported, enables it between all GPU pairs. --- ## AMDScheduleBarrierMask `@register_passable(trivial)` `struct AMDScheduleBarrierMask` Represents different instruction scheduling masks for AMDGPU scheduling instructions. These masks control which types of instructions can be reordered across a barrier for performance optimization. When used with schedule\_barrier(), the mask determines which instructions the compiler is allowed to move across the barrier point. ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALL_ALU` `alias ALL_ALU = AMDScheduleBarrierMask(1)` Allows reordering of all arithmetic and logic instructions that don't involve memory operations. ### `ALL_DS` `alias ALL_DS = AMDScheduleBarrierMask(128)` Permits reordering of all Local Data Share (LDS) operations. ### `ALL_VMEM` `alias ALL_VMEM = AMDScheduleBarrierMask(16)` Enables reordering of all vector memory operations (reads and writes). ### `DS_READ` `alias DS_READ = AMDScheduleBarrierMask(256)` Enables reordering of LDS read operations only. ### `DS_WRITE` `alias DS_WRITE = AMDScheduleBarrierMask(512)` Enables reordering of LDS write operations only. ### `MFMA` `alias MFMA = AMDScheduleBarrierMask(8)` Allows reordering of matrix multiplication and WMMA instructions. ### `NONE` `alias NONE = AMDScheduleBarrierMask(0)` No instructions can cross the barrier. Most restrictive option. ### `SALU` `alias SALU = AMDScheduleBarrierMask(4)` Permits reordering of scalar arithmetic/logic unit instructions only. ### `TRANS` `alias TRANS = AMDScheduleBarrierMask(1024)` Allows reordering of transcendental instructions (sin, cos, exp, etc). ### `VALU` `alias VALU = AMDScheduleBarrierMask(2)` Permits reordering of vector arithmetic/logic unit instructions only. ### `VMEM_READ` `alias VMEM_READ = AMDScheduleBarrierMask(32)` Allows reordering of vector memory read operations only. ### `VMEM_WRITE` `alias VMEM_WRITE = AMDScheduleBarrierMask(64)` Allows reordering of vector memory write operations only. ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` Initializes an `AMDScheduleBarrierMask` from an integer value. This implicit constructor allows creating a barrier mask directly from an integer, which is useful for combining multiple mask flags using bitwise operations. **Args:** * ​value (`Int`): The integer value to use for the barrier mask. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two `AMDScheduleBarrierMask` instances for equality. **Args:** * ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with. **Returns:** True if the masks have the same value, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two `AMDScheduleBarrierMask` instances for inequality. **Args:** * ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with. **Returns:** True if the masks have different values, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the `AMDScheduleBarrierMask`. Converts the mask to a human-readable string based on its value. **Returns:** A string representation of the mask, or aborts if the value is invalid. ### `__int__` `__int__(self) -> Int` Converts the `AMDScheduleBarrierMask` to an integer. **Returns:** The integer value of the mask, which can be used with low-level APIs. --- ## AMDSchedulerTuning `@register_passable(trivial)` `struct AMDSchedulerTuning` ## Fields * ​block\_shape (`IndexList[2]`): * ​tuning\_values (`IndexList[3]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` --- ## AndMask `@register_passable(trivial)` `struct AndMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]` Mask that's the AND of two masks. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")` ### `mask_out_of_bound` `alias mask_out_of_bound = get_vtable_entry(:trait T, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait S, "mask_out_of_bound")` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## any `any[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool` Checks if **any** element in the list is truthy. **Parameters:** * ​T (`Boolable & Copyable & Movable`): The type of elements to check. **Args:** * ​list (`List[T, hint_trivial_type]`): The list to check. **Returns:** `True` if **any** element in the list is truthy, `False` otherwise. `any[T: Boolable & KeyElement, //](set: Set[T]) -> Bool` Checks if **any** element in the set is truthy. **Parameters:** * ​T (`Boolable & KeyElement`): The type of elements to check. **Args:** * ​set (`Set[T]`): The set to check. **Returns:** `True` if **any** element in the set is truthy, `False` otherwise. `any(value: SIMD[dtype, size]) -> Bool` Checks if **any** element in the simd vector is truthy. **Args:** * ​value (`SIMD[dtype, size]`): The simd vector to check. **Returns:** `True` if **any** element in the simd vector is truthy, `False` otherwise. --- ## any_true `any_true(src: NDBuffer[type, 1, origin]) -> Bool` Returns True if any the elements in a buffer are True and False otherwise. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** True if any of the elements of the buffer are True and False otherwise. --- ## anytype Defines the core traits for object lifetime management in Mojo. This module provides the foundational traits that define how objects are created, managed and destroyed in Mojo: * `UnknownDestructibility`: The most basic trait that all types extend by default. Types with this trait have no destructor and no lifetime management. * `AnyType`: The base trait for types that require lifetime management through destructors. Any type that needs cleanup when it goes out of scope should implement this trait. * `ImplicitlyDestructible`: An alias for `AnyType` to help with the transition to linear types. Use this when you want to be explicit about a type having a destructor. These traits are built into Mojo and do not need to be imported. ## Aliases ### `ImplicitlyDestructible` `alias ImplicitlyDestructible = AnyType` ## Traits * [​`AnyType`](/mojo/stdlib/builtin/anytype/AnyType): A trait for types that require lifetime management through destructors. * [​`UnknownDestructibility`](/mojo/stdlib/builtin/anytype/UnknownDestructibility): The most basic trait that all Mojo types extend by default. --- ## AnyType A trait for types that require lifetime management through destructors. The `AnyType` trait is fundamental to Mojo's memory management system. It indicates that a type has a destructor that needs to be called when instances go out of scope. This is essential for types that own resources like memory, file handles, or other system resources that need proper cleanup. Key aspects: * Any type with a destructor must implement this trait * The destructor (`__del__`) is called automatically when an instance's lifetime ends * Composition of types with destructors automatically gets a destructor * All Mojo structs and traits inherit from `AnyType` by default unless they specify `@explicit_destroy` Example: ```mojo struct ResourceOwner(AnyType): var ptr: UnsafePointer[Int] fn __init__(out self, size: Int): self.ptr = UnsafePointer[Int].alloc(size) fn __del__(owned self): # Clean up owned resources self.ptr.free() ``` Best practices: * Implement this trait when your type owns resources that need cleanup * Ensure the destructor properly frees all owned resources * Consider using `@explicit_destroy` for types that should never have destructors * Use composition to automatically handle nested resource cleanup ## Implemented traits `UnknownDestructibility` ## Methods ### `__del__` `__del__(owned self: _Self, /)` Destroys the instance and cleans up any owned resources. This method is called automatically when an instance's lifetime ends. It receives an owned value and should perform all necessary cleanup operations like: * Freeing allocated memory * Closing file handles * Releasing system resources * Cleaning up any other owned resources The instance is considered dead after this method completes, regardless of whether any explicit cleanup was performed. --- ## API references import ListingCards from '@site/src/components/Listing/ListingCards'; export const cards = [ { title: 'Python', url: '/max/api/python', description: 'The Python library API reference.' }, { title: 'Mojo', url: '/mojo/lib', description: 'The Mojo library API reference.' }, { title: 'REST', url: '/max/api/serve', description: 'The MAX serving REST API reference.' } ] --- ## append_shape `append_shape[rank: Int](in_shape: IndexList[rank], last2nd: Int, last: Int) -> IndexList[(rank + 2)]` Append input shape by inserting `last2nd` and `last` at the end. --- ## apple_accelerate ## Aliases ### `APPLE_ACCELERATE` `alias APPLE_ACCELERATE = _Global[__init__[__mlir_type.!kgen.string]("APPLE_ACCELERATE"), _OwnedDLHandle, _init_dylib]` ### `cblas_gemm_type` `alias cblas_gemm_type = fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None` ### `LIB_ACC_PATH` `alias LIB_ACC_PATH = "/System/Library/Frameworks/Accelerate.framework/Accelerate"` ## Functions * [​`apple_batched_matmul`](./apple_batched_matmul): * [​`apple_gemv`](./apple_gemv): * [​`apple_matmul`](./apple_matmul): * [​`get_cblas_f32_function`](./get_cblas_f32_function): * [​`use_apple_accelerate_lib`](./use_apple_accelerate_lib): --- ## apple_amx_intrinsics ## Functions * [​`dot_at_b`](./dot_at_b): * [​`dot_at_b_impl`](./dot_at_b_impl): * [​`extrx`](./extrx): Extracts a row or moves it to x, result in amx0. * [​`extry`](./extry): Extracts a row or moves it to y, result in amx0. * [​`fma`](./fma): * [​`fma16`](./fma16): Float16 matrix multiply and subtract. * [​`fma32`](./fma32): Float32 matrix multiply and add. * [​`fma64`](./fma64): Float64 matrix multiply and add. * [​`fms16`](./fms16): Float16 matrix multiply and add. * [​`fsm32`](./fsm32): Float32 matrix multiply and subtract. * [​`fsm64`](./fsm64): Float64 matrix multiply and subtract. * [​`genlut`](./genlut): * [​`ldx`](./ldx): * [​`ldy`](./ldy): * [​`ldz`](./ldz): * [​`ldzi`](./ldzi): * [​`load_z`](./load_z): * [​`mac16`](./mac16): SI16 matrix multiply and add. * [​`matfp`](./matfp): Float16 matrix multiply. * [​`max_int__`](./max_int__): UI16 matrix multiply. * [​`read_x`](./read_x): * [​`read_y`](./read_y): * [​`store_x`](./store_x): * [​`store_y`](./store_y): * [​`store_z`](./store_z): * [​`stx`](./stx): * [​`sty`](./sty): * [​`stz`](./stz): * [​`stzi`](./stzi): * [​`transpose_z_to_x_or_y`](./transpose_z_to_x_or_y): * [​`vec_int__`](./vec_int__): Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`. * [​`vecfp`](./vecfp): Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`. --- ## apple_batched_matmul `apple_batched_matmul[*, transpose_b: Bool = False, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` --- ## apple_gemv `apple_gemv[*, b_packed: Bool, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape])` --- ## apple_matmul `apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](cblas_gemm_fn: fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` `apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` --- ## apply `apply[: origin.set, //, func: fn(Int) capturing -> Int](t: IntTuple[origin]) -> IntTuple` Apply a function to each integer value in an `IntTuple`. This function recursively applies the given function to each integer value in a potentially nested `IntTuple` structure, preserving the structure. **Parameters:** * ​func (`fn(Int) capturing -> Int`): Function to apply to each integer value. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to transform. **Returns:** A new `IntTuple` with the same structure but with each integer value transformed by the function. --- ## apply_epilogue `apply_epilogue[elementwise_lambda: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None, dst_layout: Layout, dst_element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1))](src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: Int)` --- ## apply_penalties_to_logits `apply_penalties_to_logits[logit_type: DType, penalty_type: DType, //, target: StringSlice[StaticConstantOrigin]](logits: LayoutTensor[logit_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_penalty: SIMD[penalty_type, 1], presence_penalty: SIMD[penalty_type, 1], repetition_penalty: SIMD[penalty_type, 1], ctx: DeviceContextPtr)` Apply penalties to the logits based on the frequency of the tokens in the batch. The frequency data is stored in a CSR format, where the frequency\_offsets is the starting index of each sequence in the frequency\_data array. The frequency\_data array is a 2D array, where: * frequency\_data\[i, 0] is the token id * frequency\_data\[i, 1] is the frequency of the token in the sequence --- ## apply_predicate `apply_predicate[predicate: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool](a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Apply a predicate function recursively to two `IntTuple`s. This function traverses two `IntTuple`s with the same structure and applies a predicate function to corresponding elements. The predicate is applied only to the leaf nodes (integer values). Note: If the structures of the two `IntTuple`s don't match (different nesting or length), the function returns False without applying the predicate. **Parameters:** * ​predicate (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A function that takes two `IntTuple`s (containing integer values) and returns a boolean result. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to compare. * ​b (`IntTuple[origin]`): Second `IntTuple` to compare. **Returns:** True if the predicate returns True for all corresponding elements and the structures match, False otherwise. --- ## apply_q `apply_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], X: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix. See `qr_factorization` for more details on the construction of the Householder reflector. --- ## apply_tiler `apply_tiler[func: fn(Layout, Layout) -> Layout](layout_a: Layout, tiler: List[Layout]) -> Layout` Applies a layout transformation function to each element of a layout with a tiler. This utility function applies the specified transformation function to each corresponding pair of elements from the layout and tiler list. It's a generic mechanism for implementing various tiling operations. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import apply_tiler, logical_divide # Apply logical_divide to each element of a layout with a tiler var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2))) var result = apply_tiler[logical_divide](base, tilers) ``` . **Parameters:** * ​func (`fn(Layout, Layout) -> Layout`): A function that takes two layouts and returns a transformed layout. **Args:** * ​layout\_a (`Layout`): The base layout to transform. * ​tiler (`List[Layout]`): A list of layouts to use in the transformation. **Returns:** A new layout resulting from applying the transformation function to each pair. --- ## apply_zip `apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple` Apply a function to pairs of elements from two `IntTuple`s. This function zips two `IntTuple`s together and applies the given function to each pair of elements, creating a new `IntTuple` with the results. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple`): Function that takes two `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each pair. `apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple` Apply a capturing function to pairs of elements from two `IntTuple`s. This overload allows the function to capture variables from its environment. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple`): Capturing function that takes two `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each pair. `apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple` Apply a function to triplets of elements from three `IntTuple`s. This function zips three `IntTuple`s together and applies the given function to each triplet of elements, creating a new `IntTuple` with the results. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple`): Function that takes three `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. * ​t3 (`IntTuple[origin]`): Third `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each triplet. `apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple` Apply a capturing function to triplets of elements from three `IntTuple`s. This overload allows the function to capture variables from its environment. **Parameters:** * ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple`): Capturing function that takes three `IntTuple`s and returns an `IntTuple`. **Args:** * ​t1 (`IntTuple[origin]`): First `IntTuple`. * ​t2 (`IntTuple[origin]`): Second `IntTuple`. * ​t3 (`IntTuple[origin]`): Third `IntTuple`. **Returns:** A new `IntTuple` containing the results of applying func to each triplet. --- ## arange `arange[type: DType, simd_width: Int](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1], index: IndexList[1]) -> SIMD[type, simd_width]` --- ## arange ## Functions * [​`arange`](./arange): * [​`arange_shape`](./arange_shape): --- ## arange_shape `arange_shape[type: DType, single_thread_blocking_override: Bool](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1]) -> IndexList[1]` --- ## arc Reference-counted smart pointers. You can import these APIs from the `memory` package. For example: ```mojo from memory import ArcPointer ``` ## Structs * [​`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer): Atomic reference-counted pointer. --- ## architectures ## `register_all_models()` {#max.pipelines.architectures.register_all_models} > max.pipelines.architectures.register\_all\_models() Imports model architectures, thus registering the architecture in the shared `PipelineRegistry`. --- ## ArcPointer `@register_passable` `struct ArcPointer[T: Movable]` Atomic reference-counted pointer. This smart pointer owns an instance of `T` indirectly managed on the heap. This pointer is copyable, including across threads, maintaining a reference count to the underlying data. When you initialize an `ArcPointer` with a value, it allocates memory and moves the value into the allocated memory. Copying an instance of an `ArcPointer` increments the reference count. Destroying an instance decrements the reference count. When the reference count reaches zero, `ArcPointer` destroys the value and frees its memory. This pointer itself is thread-safe using atomic accesses to reference count the underlying data, but references returned to the underlying data are not thread-safe. Subscripting an `ArcPointer` (`ptr[]`) returns a mutable reference to the stored value. This is the only safe way to access the stored value. Other methods, such as using the `unsafe_ptr()` method to retrieve an unsafe pointer to the stored value, or accessing the private fields of an `ArcPointer`, are unsafe and may result in memory errors. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/) in the Mojo Manual. Examples: ```mojo from memory import ArcPointer var p = ArcPointer(4) var p2 = p p2[]=3 print(3 == p[]) ``` ## Parameters * ​T (`Movable`): The type of the stored value. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Identifiable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(owned value: T) -> Self` Construct a new thread-safe, reference-counted smart pointer, and move the value into heap memory managed by the new pointer. **Args:** * ​value (`T`): The value to manage. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Copy an existing reference. Increment the refcount to the object. **Args:** * ​existing (`Self`): The existing reference. ### `__del__` `__del__(owned self)` Delete the smart pointer. Decrement the reference count for the stored value. If there are no more references, delete the object and free its memory. ### `__getitem__` `__getitem__[self_life: ImmutableOrigin](ref [self_life] self) -> ref [self_life] T` Returns a mutable reference to the managed value. **Parameters:** * ​self\_life (`ImmutableOrigin`): The origin of self. **Returns:** A reference to the managed value. ### `__is__` `__is__(self, rhs: Self) -> Bool` Returns True if the two `ArcPointer` instances point at the same object. **Args:** * ​rhs (`Self`): The other `ArcPointer`. **Returns:** True if the two `ArcPointers` instances point at the same object and False otherwise. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Returns True if the two `ArcPointer` instances point at different objects. **Args:** * ​rhs (`Self`): The other `ArcPointer`. **Returns:** True if the two `ArcPointer` instances point at different objects and False otherwise. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[T]` Retrieves a pointer to the underlying memory. **Returns:** The `UnsafePointer` to the pointee. ### `count` `count(self) -> SIMD[uint64, 1]` Count the amount of current references. **Returns:** The current amount of references to the pointee. --- ## arg Implements functions and variables for interacting with execution and system environment. You can import these APIs from the `sys` package. For example: ```mojo from sys import argv def main(): arguments = argv() print( arguments[0], #app.mojo arguments[1] #Hello world! ) for arg in arguments: print(arg) # If the program is app.mojo: # mojo run app.mojo "Hello world!" ``` ## Functions * [​`argv`](/mojo/stdlib/sys/arg/argv): The list of command line arguments. --- ## arg_nonzero `arg_nonzero[type: DType, output_type: DType, rank: Int](input_buffer: NDBuffer[type, rank, origin], output_buffer: NDBuffer[output_type, 2, origin])` Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer. **Parameters:** * ​type (`DType`): The element type. * ​output\_type (`DType`): The integer type to store the indices in. * ​rank (`Int`): The rank of the tensor. **Args:** * ​input\_buffer (`NDBuffer[type, rank, origin]`): The tensor to count the non-zeros in. * ​output\_buffer (`NDBuffer[output_type, 2, origin]`): The indices of all non-zero elements. --- ## arg_nonzero ## Functions * [​`arg_nonzero`](./arg_nonzero): Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer. * [​`arg_nonzero_shape`](./arg_nonzero_shape): Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input. --- ## arg_nonzero_shape `arg_nonzero_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input_buffer: NDBuffer[type, rank, origin]) -> IndexList[2]` Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input. **Parameters:** * ​type (`DType`): The element type. * ​rank (`Int`): The rank. * ​single\_thread\_blocking\_override (`Bool`): This op can block. **Args:** * ​input\_buffer (`NDBuffer[type, rank, origin]`): The tensor to count the non-zeros in. **Returns:** Shape of the arg\_nonzero kernel for this input \[NumNonZeros, InputRank]. --- ## argmax `argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the maximum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis (`Int`): The axis. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. `argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the maximum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. --- ## argmax_gpu `argmax_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])` --- ## argmaxmin ## Functions * [​`argmax`](./argmax): Finds the indices of the maximum element along the specified axis. * [​`argmin`](./argmin): Finds the indices of the minimum element along the specified axis. --- ## argmaxmin_gpu `argmaxmin_gpu[type: DType, output_type: DType, rank: Int, largest: Bool](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])` Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension. **Parameters:** * ​type (`DType`): DType - The data type of the input tensor. * ​output\_type (`DType`): DType - The data type of the output tensor. * ​rank (`Int`): Int - The rank of the input tensor. * ​largest (`Bool`): Bool - Whether to perform argmax or argmin. --- ## argmaxmin_gpu ## Functions * [​`argmax_gpu`](./argmax_gpu): * [​`argmaxmin_gpu`](./argmaxmin_gpu): Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension. * [​`argmin_gpu`](./argmin_gpu): --- ## argmin `argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the minimum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis (`Int`): The axis. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. `argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Finds the indices of the minimum element along the specified axis. **Args:** * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor. --- ## argmin_gpu `argmin_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])` --- ## args_to_tuple `args_to_tuple[swap: Bool](arg_0: Int, arg_1: Int) -> Tuple[Int, Int]` --- ## argsort `argsort[*, ascending: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContext)` Performs argsort on input buffer, storing indices in output buffer. **Parameters:** * ​ascending (`Bool`): Sort direction (True for ascending, False for descending). * ​target (`StringSlice[StaticConstantOrigin]`): Target device ("cpu" or "gpu"). **Args:** * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices. * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort. * ​ctx (`DeviceContext`): Device context for execution. `argsort[ascending: Bool = True](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` CPU-only version of argsort. **Parameters:** * ​ascending (`Bool`): Sort direction (True for ascending, False for descending). **Args:** * ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices. * ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort. --- ## argsort ## Functions * [​`argsort`](./argsort): Performs argsort on input buffer, storing indices in output buffer. --- ## argv `argv() -> VariadicList[StringSlice[StaticConstantOrigin]]` The list of command line arguments. **Returns:** The list of command line arguments provided when mojo was invoked. --- ## ascii `ascii(value: StringSlice[origin]) -> String` Get the ASCII representation of the object. **Args:** * ​value (`StringSlice[origin]`): The object to get the ASCII representation of. **Returns:** A string containing the ASCII representation of the object. --- ## asin `asin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `asin` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `asin` of the input. --- ## asinh `asinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `asinh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `asinh` of the input. --- ## assert_almost_equal `assert_almost_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised. When the type is boolean or integral, then equality is checked. When the type is floating-point, then this checks if the two input values are numerically the close using the $abs(lhs - rhs) dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors. * ​size (`Int`): The width of the left- and right-hand-side SIMD vectors. **Args:** * ​lhs (`SIMD[dtype, size]`): The lhs of the equality. * ​rhs (`SIMD[dtype, size]`): The rhs of the equality. * ​msg (`String`): The message to print. * ​atol (`SIMD[float64, 1]`): The absolute tolerance. * ​rtol (`SIMD[float64, 1]`): The relative tolerance. * ​equal\_nan (`Bool`): Whether to treat nans as equal. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_equal `assert_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Parameters:** * ​T (`EqualityComparable & Stringable`): The type of the input values. **Args:** * ​lhs (`T`): The lhs of the equality. * ​rhs (`T`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Args:** * ​lhs (`String`): The lhs of the equality. * ​rhs (`String`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Parameters:** * ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors. * ​size (`Int`): The width of the left- and right-hand-side SIMD vectors. **Args:** * ​lhs (`SIMD[dtype, size]`): The lhs of the equality. * ​rhs (`SIMD[dtype, size]`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are equal. **Parameters:** * ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists. **Args:** * ​lhs (`List[T]`): The left-hand side list. * ​rhs (`List[T]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[O1: ImmutableOrigin, O2: ImmutableOrigin](lhs: List[StringSlice[O1]], rhs: List[StringSlice[O2]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are equal. **Parameters:** * ​O1 (`ImmutableOrigin`): The origin of lhs. * ​O2 (`ImmutableOrigin`): The origin of rhs. **Args:** * ​lhs (`List[StringSlice[O1]]`): The left-hand side list. * ​rhs (`List[StringSlice[O2]]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal[D: DType](lhs: List[SIMD[D, 1]], rhs: List[SIMD[D, 1]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are equal. **Parameters:** * ​D (`DType`): A DType. **Args:** * ​lhs (`List[SIMD[D, 1]]`): The left-hand side list. * ​rhs (`List[SIMD[D, 1]]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_equal(lhs: PythonObject, rhs: PythonObject, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are equal. If it is not then an Error is raised. **Args:** * ​lhs (`PythonObject`): The lhs of the equality. * ​rhs (`PythonObject`): The rhs of the equality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (default to the `__call_location`). **Raises:** An Error with the provided message if assert fails. --- ## assert_false `assert_false[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly True"), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input value is False and raises an Error if it's not. **Parameters:** * ​T (`Boolable`): The type of the value argument. **Args:** * ​val (`T`): The value to assert to be False. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_is `assert_is[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values have the same identity. If they do not then an Error is raised. **Parameters:** * ​T (`Stringable & Identifiable`): A Stringable and Identifiable type. **Args:** * ​lhs (`T`): The lhs of the `is` statement. * ​rhs (`T`): The rhs of the `is` statement. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_is_not `assert_is_not[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values have different identities. If they do not then an Error is raised. **Parameters:** * ​T (`Stringable & Identifiable`): A Stringable and Identifiable type. **Args:** * ​lhs (`T`): The lhs of the `is not` statement. * ​rhs (`T`): The rhs of the `is not` statement. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_not_equal `assert_not_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are not equal. If it is not then an Error is raised. **Parameters:** * ​T (`EqualityComparable & Stringable`): The type of the input values. **Args:** * ​lhs (`T`): The lhs of the inequality. * ​rhs (`T`): The rhs of the inequality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_not_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are not equal. If it is not then an an Error is raised. **Args:** * ​lhs (`String`): The lhs of the inequality. * ​rhs (`String`): The rhs of the inequality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_not_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input values are not equal. If it is not then an Error is raised. **Parameters:** * ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors. * ​size (`Int`): The width of the left- and right-hand-side SIMD vectors. **Args:** * ​lhs (`SIMD[dtype, size]`): The lhs of the inequality. * ​rhs (`SIMD[dtype, size]`): The rhs of the inequality. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. `assert_not_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that two lists are not equal. **Parameters:** * ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists. **Args:** * ​lhs (`List[T]`): The left-hand side list. * ​rhs (`List[T]`): The right-hand side list. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assert_raises `struct assert_raises` Context manager that asserts that the block raises an exception. You can use this to test expected error cases, and to test that the correct errors are raised. For instance: ```mojo from testing import assert_raises # Good! Caught the raised error, test passes with assert_raises(): raise "SomeError" # Also good! with assert_raises(contains="Some"): raise "SomeError" # This will assert, we didn't raise with assert_raises(): pass # This will let the underlying error propagate, failing the test with assert_raises(contains="Some"): raise "OtherError" ``` ## Fields * ​message\_contains (`Optional[String]`): If present, check that the error message contains this literal string. * ​call\_location (`_SourceLocation`): Assigned the value returned by \_\_call\_locations() at Self.**init**. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, location: Optional[_SourceLocation] = Optional(None))` Construct a context manager with no message pattern. **Args:** * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). `__init__(out self, *, contains: String, location: Optional[_SourceLocation] = Optional(None))` Construct a context manager matching specific errors. **Args:** * ​contains (`String`): The test will only pass if the error message includes the literal text passed. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). ### `__enter__` `__enter__(self)` Enter the context manager. ### `__exit__` `__exit__(self)` Exit the context manager with no error. **Raises:** AssertionError: Always. The block must raise to pass the test. `__exit__(self, error: Error) -> Bool` Exit the context manager with an error. **Args:** * ​error (`Error`): The error raised. **Returns:** True if the error message contained the expected string. **Raises:** Error: If the error raised doesn't include the expected string. --- ## assert_true `assert_true[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly False"), *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the input value is True and raises an Error if it's not. **Parameters:** * ​T (`Boolable`): The type of the value argument. **Args:** * ​val (`T`): The value to assert to be True. * ​msg (`String`): The message to be printed if the assertion fails. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). **Raises:** An Error with the provided message if assert fails and `None` otherwise. --- ## assume `assume(val: Bool)` Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code. **Args:** * ​val (`Bool`): The input value which is assumed to be `True`. --- ## async_copy `async_copy[type: DType, //, size: Int, *, fill: OptionalReg[SIMD[type, 1]] = OptionalReg[SIMD[type, 1]]({:i1 0, 1}), bypass_L1_16B: Bool = True, l2_prefetch: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), eviction_policy: CacheEviction = CacheEviction(0)](src: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], dst: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], src_size: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0), predicate: Bool = False)` Asynchronously copies data from global memory to shared memory. This function provides a high-performance asynchronous memory copy operation with configurable caching behavior, prefetching, and fill values. It maps directly to the PTX cp.async instruction on NVIDIA GPUs. **Constraints:** * Fill value only supported for types type (`DType`): The data type to copy (e.g. float32, int32). * ​size (`Int`): Number of bytes to copy (must be 4, 8, or 16). * ​fill (`OptionalReg[SIMD[type, 1]]`): Optional fill value for uncopied bytes when src\_size bypass\_L1\_16B (`Bool`): If True, bypasses L1 cache for 16-byte copies. * ​l2\_prefetch (`OptionalReg[Int]`): Optional L2 prefetch size (64, 128, or 256 bytes). * ​eviction\_policy (`CacheEviction`): Cache eviction policy for the copy operation. **Args:** * ​src (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Source pointer in global memory. * ​dst (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): Destination pointer in shared memory. * ​src\_size (`SIMD[int32, 1]`): Actual bytes to copy from src (remaining bytes use fill value). * ​predicate (`Bool`): Optional predicate to conditionally execute the copy. --- ## async_copy_arrive `async_copy_arrive[type: AnyType, address_space: AddressSpace](address: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin])` Makes a memory barrier track all prior async copy operations from this thread. This function ensures that all previously initiated asynchronous copy operations from the executing thread are tracked by the memory barrier at the specified location. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. * ​address\_space (`AddressSpace`): The memory address space where the barrier is located. **Args:** * ​address (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory barrier object location. --- ## async_copy_commit_group `async_copy_commit_group()` Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group. This function creates a new cp.async-group containing all previously initiated but uncommitted asynchronous copy operations. The group can then be waited on using async\_copy\_wait\_group(). Notes: * Only supported on NVIDIA GPUs * Maps to the cp.async.commit.group PTX instruction * Used for managing asynchronous memory transfers * Should be paired with async\_copy\_wait\_group() or async\_copy\_wait\_all() --- ## async_copy_wait_all `async_copy_wait_all()` Waits for completion of all committed cp.async-groups. This function blocks execution until all previously committed cp.async-groups have completed their memory transfers. It provides a barrier to ensure all asynchronous copies are finished. Notes: * Only supported on NVIDIA GPUs. * Maps to the cp.async.wait.all PTX instruction. * Ensures all outstanding asynchronous transfers are complete. * More coarse-grained than `async_copy_wait_group()`. --- ## async_copy_wait_group `async_copy_wait_group(n: SIMD[int32, 1])` Waits for the completion of `n` most recently committed cp.async-groups. This function blocks execution until the specified number of previously committed cp.async-groups have completed their memory transfers. Notes: * Only supported on NVIDIA GPUs. * Maps to the cp.async.wait.group PTX instruction. * Provides fine-grained control over asynchronous transfer synchronization. * Can be used to implement a pipeline of asynchronous transfers. **Args:** * ​n (`SIMD[int32, 1]`): The number of pending cp.async-groups to wait for. Must be > 0. --- ## asyncrt This module implements the low level concurrency library. ## Structs * [​`DeviceContextPtr`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtr): Exposes a pointer to a C++ DeviceContext to Mojo. * [​`DeviceContextPtrList`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtrList): A fixed-size collection of `DeviceContextPtr` objects. * [​`Task`](/mojo/stdlib/runtime/asyncrt/Task): Represents an asynchronous task that will produce a value of the specified type. * [​`TaskGroup`](/mojo/stdlib/runtime/asyncrt/TaskGroup): A group of tasks that can be executed concurrently. * [​`TaskGroupContext`](/mojo/stdlib/runtime/asyncrt/TaskGroupContext): Context structure for task group operations. ## Functions * [​`create_task`](/mojo/stdlib/runtime/asyncrt/create_task): Run the coroutine as a task on the AsyncRT Runtime. * [​`parallelism_level`](/mojo/stdlib/runtime/asyncrt/parallelism_level): Gets the parallelism level of the Runtime. --- ## atan `atan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `atan` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `atan` of the input. --- ## atan2 `atan2[dtype: DType, width: Int, //](y: SIMD[dtype, width], x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `atan2` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​y (`SIMD[dtype, width]`): The first input argument. * ​x (`SIMD[dtype, width]`): The second input argument. **Returns:** The `atan2` of the inputs. --- ## atanh `atanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `atanh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `atanh` of the input. --- ## atof `atof(str_slice: StringSlice[origin]) -> SIMD[float64, 1]` Parses the given string as a floating point and returns that value. For example, `atof("2.25")` returns `2.25`. This function is in the prelude, so you don't need to import it. **Args:** * ​str\_slice (`StringSlice[origin]`): A string to be parsed as a floating point. **Returns:** An floating point value that represents the string, or otherwise raises. **Raises:** If the given string cannot be parsed as an floating point value, for example in `atof("hi")`. --- ## atol `atol(str_slice: StringSlice[origin], base: Int = 10) -> Int` Parses and returns the given string as an integer in the given base. If base is set to 0, the string is parsed as an integer literal, with the following considerations: * '0b' or '0B' prefix indicates binary (base 2) * '0o' or '0O' prefix indicates octal (base 8) * '0x' or '0X' prefix indicates hexadecimal (base 16) * Without a prefix, it's treated as decimal (base 10) This follows [Python's integer literals format](https://docs.python.org/3/reference/lexical_analysis.html#integers). This function is in the prelude, so you don't need to import it. Examples: ```text >>> atol("32") 32 >>> atol("FF", 16) 255 >>> atol("0xFF", 0) 255 >>> atol("0b1010", 0) 10 ``` **Args:** * ​str\_slice (`StringSlice[origin]`): A string to be parsed as an integer in the given base. * ​base (`Int`): Base used for conversion, value must be between 2 and 36, or 0. **Returns:** An integer value that represents the string. **Raises:** If the given string cannot be parsed as an integer value or if an incorrect base is provided. --- ## atomic Implements the `Atomic` struct. You can import these APIs from the `os` package. For example: ```mojo from os import Atomic ``` ## Structs * [​`Atomic`](/mojo/stdlib/os/atomic/Atomic): Represents a value with atomic operations. * [​`Consistency`](/mojo/stdlib/os/atomic/Consistency): Represents the consistency model for atomic operations. --- ## Atomic `struct Atomic[dtype: DType, *, scope: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")]` Represents a value with atomic operations. The class provides atomic `add` and `sub` methods for mutating the value. ## Parameters * ​dtype (`DType`): DType of the value. * ​scope (`StringSlice[StaticConstantOrigin]`): The memory synchronization scope. ## Fields * ​value (`SIMD[dtype, 1]`): The atomic value. This is the underlying value of the atomic. Access to the value can only occur through atomic primitive operations. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, value: SIMD[dtype, 1])` Constructs a new atomic value. **Args:** * ​value (`SIMD[dtype, 1]`): Initial value represented as `Scalar[dtype]` type. ### `__iadd__` `__iadd__(mut self, rhs: SIMD[dtype, 1])` Performs atomic in-place add. Atomically replaces the current value with the result of arithmetic addition of the value and arg. That is, it performs atomic post-increment. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to add. ### `__isub__` `__isub__(mut self, rhs: SIMD[dtype, 1])` Performs atomic in-place sub. Atomically replaces the current value with the result of arithmetic subtraction of the value and arg. That is, it performs atomic post-decrement. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to subtract. ### `load` `load(mut self) -> SIMD[dtype, 1]` Loads the current value from the atomic. **Returns:** The current value of the atomic. ### `fetch_add` `static fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Performs atomic in-place add. Atomically replaces the current value with the result of arithmetic addition of the value and arg. That is, it performs atomic post-increment. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer. * ​rhs (`SIMD[dtype, 1]`): Value to add. **Returns:** The original value before addition. `fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Performs atomic in-place add. Atomically replaces the current value with the result of arithmetic addition of the value and arg. That is, it performs atomic post-increment. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to add. **Returns:** The original value before addition. ### `store` `static store[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], value: SIMD[dtype, 1])` Performs atomic store. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer. * ​value (`SIMD[dtype, 1]`): The value to store. ### `fetch_sub` `fetch_sub[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Performs atomic in-place sub. Atomically replaces the current value with the result of arithmetic subtraction of the value and arg. That is, it performs atomic post-decrement. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to subtract. **Returns:** The original value before subtraction. ### `compare_exchange_weak` `compare_exchange_weak[*, failure_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6)), success_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, mut expected: SIMD[dtype, 1], desired: SIMD[dtype, 1]) -> Bool` Atomically compares the self value with that of the expected value. If the values are equal, then the self value is replaced with the desired value and True is returned. Otherwise, False is returned the the expected value is rewritten with the self value. **Parameters:** * ​failure\_ordering (`Consistency`): The memory ordering for the failure case. * ​success\_ordering (`Consistency`): The memory ordering for the success case. **Args:** * ​expected (`SIMD[dtype, 1]`): The expected value. * ​desired (`SIMD[dtype, 1]`): The desired value. **Returns:** True if self == expected and False otherwise. ### `max` `static max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])` Performs atomic in-place max on the pointer. Atomically replaces the current value pointer to by `ptr` by the result of max of the value and arg. The operation is a read-modify-write operation. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer. * ​rhs (`SIMD[dtype, 1]`): Value to max. `max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])` Performs atomic in-place max. Atomically replaces the current value with the result of max of the value and arg. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to max. ### `min` `static min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])` Performs atomic in-place min on the pointer. Atomically replaces the current value pointer to by `ptr` by the result of min of the value and arg. The operation is a read-modify-write operation. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer. * ​rhs (`SIMD[dtype, 1]`): Value to min. `min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])` Performs atomic in-place min. Atomically replaces the current value with the result of min of the value and arg. The operation is a read-modify-write operation. The operation is a read-modify-write operation perform according to sequential consistency semantics. **Constraints:** The input type must be either integral or floating-point type. **Parameters:** * ​ordering (`Consistency`): The memory ordering. **Args:** * ​rhs (`SIMD[dtype, 1]`): Value to min. --- ## attention A vanilla opaque KV Cache optimized attention mechanism. ## `Attention` {#max.nn.attention.attention.Attention} > *class* max.nn.attention.attention.Attention(n\_heads: 'int', kv\_params: 'KVCacheParams', wqkv: 'TensorValue', wo: 'LinearV1', scale: 'float') **Parameters:** * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **wqkv** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) ) * **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ## `AttentionQKV` {#max.nn.attention.attention.AttentionQKV} > *class* max.nn.attention.attention.AttentionQKV(n\_heads: 'int', kv\_params: 'KVCacheParams', wq: 'TensorValueLike', wk: 'TensorValueLike', wv: 'TensorValueLike', wo: 'LinearV1', scale: 'float') **Parameters:** * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **wq** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wk** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wv** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) --- ## attention ## Modules * [`attention`](/max/api/python/nn/attention/attention) * [`attention_with_rope`](/max/api/python/nn/attention/attention_with_rope) * [`ragged_attention`](/max/api/python/nn/attention/ragged_attention) * [`interfaces`](/max/api/python/nn/attention/interfaces) --- ## Attention A mechanism used in AI models such as [transformers](transformer.mdx) that enables the model to selectively focus on different parts of the input sequence when making predictions. Unlike traditional model architectures that process all input data with equal importance, models with attention assign different importance levels to different tokens (such as words or pixels). This allows the model to better understand the complete meaning of the input, especially when an accurate meaning depends on relationships between tokens that are far apart (such as between words that occur far apart in a sentence). Attention is crucial for large language models (LLMs) so they can capture long-range dependencies and contextual relationships in the given text. It allows LLMs to handle complex and nuanced language, enabling them to generate coherent and contextually relevant output even when the input text includes nuanced references to other parts of the text. Attention was introduced and refined in the papers [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) (Bahdanau et al., 2014) and [Effective Approaches to Attention-based Neural Machine Translation ](https://arxiv.org/abs/1508.04025) (Luong et al., 2015). The most well-known form of attention is [self-attention](self-attention.mdx), in which each token gets its own attention score for every other token (each token "attends to" all other tokens), in order to determine the relative importance of each other token in that context. --- ## Attention mask An attention mask is a mechanism used in the [attention](attention.mdx) layers of a [transformer](transformer.mdx) model to indicate which tokens the model should ignore when computing attention scores. For example, attention masks can prevent the model from attending to [padding tokens](padding-tokens.mdx), which are added to make sequences in a batch the same length and thus offer no information for attention. Another common mask is a "causal mask" (or "look-ahead mask"), which prevents the [self-attention](self-attention) layer from looking at future tokens when predicting a new token, ensuring that it attends only to previous tokens in the sequence. Although it sounds absurd that it would even try to look at future tokens (because it's generating tokens one at a time, in order), the self-attention is designed for more general-purpose attention scoring. In its most basic form, self-attention is agnostic to token order—it looks at all tokens in the sequence equally, based on their embeddings, and calculates scores by looking both backward and ahead in the sequence. (For example, self-attention is used during [context encoding](context-encoding.mdx) to establish an understanding of the input text.) So instead of creating a different kind of attention mechanism for autoregressive inference, the causal mask instructs the self-attention layer to simply ignore all future tokens and only look backward when generating scores that help predict the next token. --- ## attention_with_rope An opaque KV Cache optimized attention mechanism with Rope. ## `AttentionWithRope` {#max.nn.attention.attention_with_rope.AttentionWithRope} > *class* max.nn.attention.attention\_with\_rope.AttentionWithRope(\*, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, devices=None, dtype=float32, linear\_cls=\, stacked\_qkv=False, scale=None, has\_bias=False, float8\_config=None, clip\_qkv=None) Implementation of attention that uses the rope frequency. Initializes the attention layer. **Parameters:** * **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from. * **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads. * **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads. * **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states. * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type. * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the QKV and output projection weights. * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` `|` `None` ) – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation. * **linear\_cls** (`Callable` `[` `...` `,` [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer. * **stacked\_qkv** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether the weights are stacked together. * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – Value used to scale the results of the attention output. * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias. * **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – If provided, the QKV weights are clamped between \[-clip\_qkv, clip\_qkv] * **float8\_config** ([`Float8Config`](../linear.md#max.nn.linear.Float8Config) `|` `None` ) ### `qkv_input_scale` {#max.nn.attention.attention_with_rope.AttentionWithRope.qkv_input_scale} > *property* qkv\_input\_scale\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* The max of q, k, and v scale input vectors. ### `qkv_weight_scale` {#max.nn.attention.attention_with_rope.AttentionWithRope.qkv_weight_scale} > *property* qkv\_weight\_scale\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The max of q, k, and v scale weight vectors. ### `rope` {#max.nn.attention.attention_with_rope.AttentionWithRope.rope} > rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\* ### `wqkv` {#max.nn.attention.attention_with_rope.AttentionWithRope.wqkv} > *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The concatenation of q, k, and v weight vectors. ### `wqkv_bias` {#max.nn.attention.attention_with_rope.AttentionWithRope.wqkv_bias} > *property* wqkv\_bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* The concatenation of q, k, and v bias weight vectors. ## `AttentionWithRopeQKV` {#max.nn.attention.attention_with_rope.AttentionWithRopeQKV} > *class* max.nn.attention.attention\_with\_rope.AttentionWithRopeQKV(n\_heads: 'int', kv\_params: 'KVCacheParams', wq: 'TensorValueLike', wk: 'TensorValueLike', wv: 'TensorValueLike', wo: 'LinearV1', scale: 'float', rope: 'OptimizedRotaryEmbedding') **Parameters:** * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **wq** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wk** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wv** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) ### `rope` {#max.nn.attention.attention_with_rope.AttentionWithRopeQKV.rope} > rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\* ## `AttentionWithRopeV1` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1} > *class* max.nn.attention.attention\_with\_rope.AttentionWithRopeV1(n\_heads, kv\_params, wqkv, wo, scale, rope, bias=None, perm\_idx=None, quantization\_config=None) Implementation of attention that uses the rope frequency. Deprecated: Use AttentionWithRope instead. **Parameters:** * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **wqkv** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) ) * **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) * **bias** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) * **perm\_idx** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) * **quantization\_config** ([`QuantizationConfig`](../../graph/quantization.md#max.graph.quantization.QuantizationConfig) `|` `None` ) ### `bias` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.bias} > bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `perm_idx` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.perm_idx} > perm\_idx\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `quantization_config` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.quantization_config} > quantization\_config\*: [QuantizationConfig](../../graph/quantization.md#max.graph.quantization.QuantizationConfig) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `rope` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.rope} > rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\* ## `DistributedAttentionWithRope` {#max.nn.attention.attention_with_rope.DistributedAttentionWithRope} > *class* max.nn.attention.attention\_with\_rope.DistributedAttentionWithRope(\*\*kwargs) Initializes the attention layer. **Parameters:** * **rope** – The rope layer to borrow the freq\_cis value from. * **num\_attention\_heads** – The number of attention heads. * **num\_key\_value\_heads** – Number of key/value heads. * **hidden\_size** – The dimension of the hidden states. * **kv\_params** – KV Cache Params, including the number of kv heads, the head dim, and data type. * **dtype** – DType of the QKV and output projection weights. * **devices** – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation. * **linear\_cls** – Linear class to use for the outputs dense layer. * **stacked\_qkv** – Whether the weights are stacked together. * **scale** – Value used to scale the results of the attention output. * **has\_bias** – Whether to use an attention bias. * **clip\_qkv** – If provided, the QKV weights are clamped between \[-clip\_qkv, clip\_qkv] ## `GGUFQAttentionWithRope` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope} > *class* max.nn.attention.attention\_with\_rope.GGUFQAttentionWithRope(\*, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, dtype, quantization\_encoding, devices=None, linear\_cls=\, scale=None, has\_bias=False, clip\_qkv=None) Implementation of attention with GGUF quantized weights. Initializes the attention layer. **Parameters:** * **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from. * **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads. * **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads. * **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states. * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type. * **layer\_idx** – The layer number associated with this Attention block. * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the weights, should always be uint8. * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` `|` `None` ) – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation. * **quantization\_encoding** ([`QuantizationEncoding`](../../graph/quantization.md#max.graph.quantization.QuantizationEncoding) ) – Quantization encoding of the weights. * **linear\_cls** (`Callable` `[` `...` `,` [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer. * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – Value used to scale the results of the attention output. * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias. * **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – If provided, the QKV weights are clamped between \[-clip\_qkv, clip\_qkv] ### `rope` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope.rope} > rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\* ### `wqkv` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope.wqkv} > *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The concatenation of q, k, and v weight vectors. ### `wqkv_bias` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope.wqkv_bias} > *property* wqkv\_bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* The concatenation of q, k, and v bias weight vectors. ## `GPTQAttentionWithRope` {#max.nn.attention.attention_with_rope.GPTQAttentionWithRope} > *class* max.nn.attention.attention\_with\_rope.GPTQAttentionWithRope(quantization\_config, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, devices=None, dtype=float32, scale=None, linear\_cls=\) Implementation of the GPT-Q attention layer. Initializes the attention layer. **Parameters:** * **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from. * **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads. * **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads. * **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states. * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type. * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the QKV and output projection weights. * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` `|` `None` ) – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation. * **linear\_cls** (`Callable` `[` `...` `,` [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer. * **stacked\_qkv** – Whether the weights are stacked together. * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – Value used to scale the results of the attention output. * **has\_bias** – Whether to use an attention bias. * **clip\_qkv** – If provided, the QKV weights are clamped between \[-clip\_qkv, clip\_qkv] * **quantization\_config** ([`QuantizationConfig`](../../graph/quantization.md#max.graph.quantization.QuantizationConfig) ) ### `wqkv` {#max.nn.attention.attention_with_rope.GPTQAttentionWithRope.wqkv} > *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The concatenation of q, k, and v weight vectors. ## `LatentAttentionWithRope` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope} > *class* max.nn.attention.attention\_with\_rope.LatentAttentionWithRope(\*, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, dtype, devices=None, linear\_cls=\, scale=None, has\_bias=False, clip\_qkv=None, q\_lora\_rank=None, kv\_lora\_rank=512, qk\_nope\_head\_dim=128, qk\_rope\_head\_dim=64, v\_head\_dim=128, buffer\_size=16384) Implementation of Latent Attention with Rope. Initializes the attention layer. **Parameters:** * **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from. * **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads. * **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads. * **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states. * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type. * **layer\_idx** – The layer number associated with this Attention block. * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the weights, should always be uint8. * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` `|` `None` ) – Device to place the weights and run the computation. If multiple are provided, the first device is used. Use DistributedAttentionWithRope to use all devices during attention computation. * **quantization\_encoding** – Quantization encoding of the weights. * **linear\_cls** (`Callable` `[` `...` `,` [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer. * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – Value used to scale the results of the attention output. * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias. * **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – If provided, the QKV weights are clamped between \[-clip\_qkv, clip\_qkv] * **buffer\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Buffer size for storing the temporal results during prefill, in unit of tokens. * **q\_lora\_rank** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **kv\_lora\_rank** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **qk\_nope\_head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **qk\_rope\_head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **v\_head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `rope` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.rope} > rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\* ### `w_uk_uv` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.w_uk_uv} > *property* w\_uk\_uv\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)]\* The concatenation of q, k, and v weight vectors. ### `wqkv` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.wqkv} > *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The concatenation of q, k, and v weight vectors. ### `wqkv_bias` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.wqkv_bias} > *property* wqkv\_bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* The concatenation of q, k, and v bias weight vectors. ## `distribute_value()` {#max.nn.attention.attention_with_rope.distribute_value} > max.nn.attention.attention\_with\_rope.distribute\_value(v, devices) **Parameters:** * **v** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) ) * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](../../graph/TensorValue.md#max.graph.TensorValue)] --- ## Attribute `@register_passable(trivial)` `struct Attribute` Represents GPU kernel function attributes. This struct defines constants for various function attributes that can be queried or set for GPU kernels. These attributes provide information about resource requirements and execution constraints of kernel functions. ## Fields * ​code (`SIMD[int32, 1]`): The numeric code representing the attribute type. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `BINARY_VERSION` `alias BINARY_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](6))` The binary architecture version for which the function was compiled. This value is the major binary version \* 10 + the minor binary version, so a binary version 1.3 function would return the value 13. Note that this will return a value of 10 for legacy cubins that do not have a properly- encoded binary architecture version.. ### `CACHE_MODE_CA` `alias CACHE_MODE_CA = Attribute(__init__[__mlir_type.!pop.int_literal](7))` The attribute to indicate whether the function has been compiled with user specified option "-Xptxas --dlcm=ca" set . ### `CLUSTER_SCHEDULING_POLICY_PREFERENCE` `alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = Attribute(__init__[__mlir_type.!pop.int_literal](15))` The block scheduling policy of a function. The value type is CUclusterSchedulingPolicy / cudaClusterSchedulingPolicy. ### `CLUSTER_SIZE_MUST_BE_SET` `alias CLUSTER_SIZE_MUST_BE_SET = Attribute(__init__[__mlir_type.!pop.int_literal](10))` If this attribute is set, the kernel must launch with a valid cluster size specified. ### `CONST_SIZE_BYTES` `alias CONST_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](2))` The size in bytes of user-allocated constant memory required by this function. ### `LOCAL_SIZE_BYTES` `alias LOCAL_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](3))` The size in bytes of local memory used by each thread of this function. ### `MAX_DYNAMIC_SHARED_SIZE_BYTES` `alias MAX_DYNAMIC_SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](8))` The maximum size in bytes of dynamically-allocated shared memory that can be used by this function. If the user-specified dynamic shared memory size is larger than this value. ### `MAX_THREADS_PER_BLOCK` `alias MAX_THREADS_PER_BLOCK = Attribute(__init__[__mlir_type.!pop.int_literal](0))` The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded. ### `NON_PORTABLE_CLUSTER_SIZE_ALLOWED` `alias NON_PORTABLE_CLUSTER_SIZE_ALLOWED = Attribute(__init__[__mlir_type.!pop.int_literal](14))` Whether the function can be launched with non-portable cluster size. 1 is allowed, 0 is disallowed. A non-portable cluster size may only function on the specific SKUs the program is tested on. The launch might fail if the program is run on a different hardware platform.CUDA API provides cudaOccupancyMaxActiveClusters to assist with checking whether the desired size can be launched on the current device.Portable Cluster SizeA portable cluster size is guaranteed to be functional on all compute capabilities higher than the target compute capability. The portable cluster size for sm\_90 is 8 blocks per cluster. ### `NUM_REGS` `alias NUM_REGS = Attribute(__init__[__mlir_type.!pop.int_literal](4))` The number of registers used by each thread of this function. ### `PREFERRED_SHARED_MEMORY_CARVEOUT` `alias PREFERRED_SHARED_MEMORY_CARVEOUT = Attribute(__init__[__mlir_type.!pop.int_literal](9))` On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory. ### `PTX_VERSION` `alias PTX_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](5))` The PTX virtual architecture version for which the function was compiled. This value is the major PTX version \* 10 + the minor PTX version, so a PTX version 1.3 function would return the value 13. Note that this may return the undefined value of 0 for cubins compiled prior to CUDA 3.0.. ### `REQUIRED_CLUSTER_DEPTH` `alias REQUIRED_CLUSTER_DEPTH = Attribute(__init__[__mlir_type.!pop.int_literal](13))` The required cluster depth in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time. ### `REQUIRED_CLUSTER_HEIGHT` `alias REQUIRED_CLUSTER_HEIGHT = Attribute(__init__[__mlir_type.!pop.int_literal](12))` The required cluster height in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time. ### `REQUIRED_CLUSTER_WIDTH` `alias REQUIRED_CLUSTER_WIDTH = Attribute(__init__[__mlir_type.!pop.int_literal](11))` The required cluster width in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time. ### `SHARED_SIZE_BYTES` `alias SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](1))` The size in bytes of statically-allocated shared memory required by this function. This does not include dynamically-allocated shared memory requested by the user at runtime. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two Attribute instances are equal. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if both attributes have the same code, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two Attribute instances are not equal. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if the attributes have different codes, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Identity comparison operator for Attribute instances. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if both attributes are identical (have the same code), False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Negative identity comparison operator for Attribute instances. **Args:** * ​other (`Self`): The Attribute to compare with. **Returns:** True if the attributes are not identical, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of the `Attribute` to the provided writer. ``` This method converts the `Attribute` enum value to its corresponding string name and writes it to the provided writer object. ``` **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): A Writer object that will receive the string representation. --- ## Autoregression Autoregression is a process by which an AI model iteratively predicts future values based on previous values in a sequence, using its own output as input to itself. Because each prediction depends on prior context, the process is sequential, which limits parallelization. Autoregression is a standard procedure in [transformer](transformer.mdx) models such as large language models (LLMs) and other models that perform time-series forecasting. This autoregressive process explains why AI chat bots like ChatGPT and Gemini stream the output one word at a time—although they sometimes run so fast that they appear to produce more than one word at a time. --- ## avg_pool `avg_pool[type: DType, int_type: DType, rank: Int = 4, count_boundary: Bool = False](input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)` Computes the average pool. Params: count\_boundary: Whether to count the boundary in the average computation. **Args:** * ​input (`NDBuffer[type, rank, origin]`): Batched image input to the pool2d operator. * ​filter (`NDBuffer[int_type, 1, origin]`): Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`NDBuffer[int_type, 1, origin]`): Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`NDBuffer[int_type, 1, origin]`): Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`NDBuffer[int_type, 1, origin]`): Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`NDBuffer[type, rank, origin]`): Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## avg_pool_gpu `avg_pool_gpu[type: DType, int_type: DType, rank: Int = 4, count_boundary: Bool = False](ctx: DeviceContext, input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)` Computes the average pool on GPU. Params: count\_boundary: Whether to count the boundary in the average computation. **Args:** * ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution. * ​input (`NDBuffer[type, rank, origin]`): (On device) Batched image input to the pool2d operator. * ​filter (`NDBuffer[int_type, 1, origin]`): (On host) Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`NDBuffer[int_type, 1, origin]`): (On host) Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`NDBuffer[int_type, 1, origin]`): (On host) Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`NDBuffer[int_type, 1, origin]`): (On host) Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`NDBuffer[type, rank, origin]`): (On device) Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## Axis `@register_passable(trivial)` `struct Axis` ## Fields * ​axis (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Indexer`, `Intable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(axis: Int) -> Self` `__init__(out self, axis: Int, rank: Int)` ### `__int__` `__int__(self) -> Int` ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. --- ## b16decode `b16decode(str: StringSlice[origin]) -> String` Performs base16 decoding on the input string. **Args:** * ​str (`StringSlice[origin]`): A base16 encoded string. **Returns:** The decoded string. --- ## b16encode `b16encode(str: StringSlice[origin]) -> String` Performs base16 encoding on the input string slice. **Args:** * ​str (`StringSlice[origin]`): The input string slice. **Returns:** Base16 encoding of the input string. --- ## b64decode `b64decode[*, validate: Bool = False](str: StringSlice[origin]) -> String` Performs base64 decoding on the input string. **Parameters:** * ​validate (`Bool`): If true, the function will validate the input string. **Args:** * ​str (`StringSlice[origin]`): A base64 encoded string. **Returns:** The decoded string. --- ## b64encode `b64encode(input_bytes: Span[SIMD[uint8, 1], origin], mut result: String)` Performs base64 encoding on the input string. Notes: This method reserves the necessary capacity. `result` can be a 0 capacity string. **Args:** * ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer. * ​result (`String`): The string in which to store the values. `b64encode(input_string: StringSlice[origin]) -> String` Performs base64 encoding on the input string. **Args:** * ​input\_string (`StringSlice[origin]`): The input string buffer. **Returns:** The ASCII base64 encoded string. `b64encode(input_bytes: Span[SIMD[uint8, 1], origin]) -> String` Performs base64 encoding on the input string. **Args:** * ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer. **Returns:** The ASCII base64 encoded string. --- ## Backend `@register_passable(trivial)` `struct Backend` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `AUTOMATIC` `alias AUTOMATIC = Backend(0)` ### `CUBLAS` `alias CUBLAS = Backend(1)` ### `CUBLASLT` `alias CUBLASLT = Backend(2)` ### `HIPBLASLT` `alias HIPBLASLT = Backend(4)` ### `ROCBLAS` `alias ROCBLAS = Backend(3)` ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__is__` `__is__(self, other: Self) -> Bool` ### `__isnot__` `__isnot__(self, other: Self) -> Bool` ### `__int__` `__int__(self) -> Int` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## ballot `ballot[dtype: DType](value: Bool) -> SIMD[dtype, 1]` Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask. **Parameters:** * ​dtype (`DType`): The DType of the return type. **Args:** * ​value (`Bool`): The value to place across the mask. **Returns:** A bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes. --- ## barrier `barrier()` Performs a synchronization barrier at the block level. This is equivalent to \_\_syncthreads() in CUDA. All threads in a thread block must execute this function before any thread can proceed past the barrier. This ensures memory operations before the barrier are visible to all threads after the barrier. --- ## base64 Provides functions for base64 encoding strings. You can import these APIs from the `base64` package. For example: ```mojo from base64 import b64encode ``` ## Functions * [​`b16decode`](/mojo/stdlib/base64/base64/b16decode): Performs base16 decoding on the input string. * [​`b16encode`](/mojo/stdlib/base64/base64/b16encode): Performs base16 encoding on the input string slice. * [​`b64decode`](/mojo/stdlib/base64/base64/b64decode): Performs base64 decoding on the input string. * [​`b64encode`](/mojo/stdlib/base64/base64/b64encode): Performs base64 encoding on the input string. --- ## base64 Implements the base64 package. ## Modules * [​`base64`](/mojo/stdlib/base64/base64/): Provides functions for base64 encoding strings. --- ## basename `basename[PathLike: PathLike, //](path: PathLike) -> String` Returns the tail section of a path. ```mojo from os.path import basename basename("a/path/foo.txt") # returns "foo.txt" ``` **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to retrieve the basename from. **Returns:** The basename from the path. --- ## Basics of GPU programming with Mojo import Requirements from '@site/src/components/Requirements'; import { requirementsWithGPU } from '@site/docs/max/requirements'; If you have any questions or feedback for this content, please post it in the [Modular forum thread here](https://forum.modular.com/t/gpu-programming-manual/755). This documentation aims to build your GPU programming knowledge from the ground up, starting with the lowest levels of the stack before progressing to higher-level functionality. It’s designed for a diverse audience, from experienced GPU developers to programmers new to GPU coding. Mojo allows you to program NVIDIA GPUs, with direct access to low-level GPU primitives, while sharing types and functions that can also run on CPUs where applicable. If you're experienced with [NVIDIA Compute Unified Device Architecture](https://developer.nvidia.com/cuda-toolkit) (CUDA), what you'll learn here will enable you to expand your reach as we release support for more hardware. ## Introduction to massively parallel programming We can no longer rely on new generations of CPUs to increase application performance through improved clock speeds. Power demands and heat dissipation limits have stalled that trend, pushing the hardware industry toward increasing the number of physical cores. Modern consumer CPUs now boast 16 cores or more, capable of running in parallel, which forces programmers to rethink how they maximize performance. This shift is especially evident in AI applications, where performance scales remarkably well with additional cores. NVIDIA’s breakthrough came with CUDA, a general programming model that allows developers to target both server and consumer GPUs for any application domain. This vision sparked an AI revolution when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton trained AlexNet on consumer GPUs, significantly outperforming traditional computer vision methods. GPUs pack thousands of cores, the NVIDIA H100 can run 16,896 threads in parallel in a single clock cycle, with over 270,000 threads queued and ready to go. They're also engineered in a way where the cost of scheduling threads is much lower compared to a traditional CPU. Harnessing this hardware requires a new programming mindset. Mojo represents a chance to rethink GPU programming and make it more approachable. C/C++ is at the core of GPU programming, but we’ve seen leaps in ergonomics and memory safety from systems programming languages in recent years. Mojo expands on Python’s familiar syntax, adds direct access to low-level CPU and GPU intrinsics for systems programming, and introduces ergonomic and safety improvements from modern languages. This course aims to empower programmers with minimal specialized knowledge to build high-performance, GPU-enabled applications. By lowering the barrier to entry, we aim to fuel more breakthroughs and accelerate innovation. ## Setup System requirements: :::note These examples can run on many consumer NVIDIA GeForce GPUs, though they aren't officially supported yet. Make sure you have the latest NVIDIA drivers. ::: All of these notebook cells are runnable through a VS Code extension. You can install [Markdown Lab](https://marketplace.visualstudio.com/items?itemName=jackos.mdlab), then clone the repo that contains the markdown that generated this website: ```sh git clone git@github.com:modular/max cd max/mojo/docs/manual/gpu ``` And open `basics.mdx` to run the code cells interactively. If you prefer the traditional approach using a CLI, first install magic if you don't have it: ```bash curl -ssL https://magic.modular.com | bash ``` Then restart your terminal, create a project, and enter the virtual environment: ```sh magic init gpu-basics --mojoproject cd gpu-basics magic shell # enter virtual environment ``` You can now create file such as `main.mojo` and put everything except the imports into a `def main`: ```mojo :once from gpu import thread_idx from gpu.host import DeviceContext def main(): fn printing_kernel(): print("GPU thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]") var ctx = DeviceContext() ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4) ctx.synchronize() ``` Then compile and run the file using `mojo main.mojo`. When you're ready to exit the virtual environment run the command: `exit`. ## Imports These are all the imports required to run the examples, put this at the top of your file if you're running from `mojo main.mojo`: ```mojo from gpu import thread_idx, block_idx, warp, barrier from gpu.host import DeviceContext, DeviceBuffer, HostBuffer from gpu.memory import AddressSpace from memory import stack_allocation from layout import Layout, LayoutTensor from math import iota from sys import sizeof ``` ## Your first kernel In the context of GPU programming, a kernel is a program that runs on each thread that you launch: ```mojo fn printing_kernel(): print("GPU thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]") ``` :::note We're using `fn` here without the `raises` keyword because a kernel function is not allowed to raise an error condition. When you define a Mojo function with `def`, the compiler always assumes that the function *can* raise an error condition. See [Functions](/mojo/manual/functions) more information. ::: We can pass this function as a parameter to `enqueue_function()` to compile it for your attached GPU and launch it. First we need to get the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) for your GPU: ```mojo var ctx = DeviceContext() ``` Now we have the `DeviceContext` we can compile and launch the kernel: ```mojo :once ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4) # Wait for the kernel to finish executing before handing back to CPU ctx.synchronize() ``` ```text GPU thread: [ 0 0 0 ] GPU thread: [ 1 0 0 ] GPU thread: [ 2 0 0 ] GPU thread: [ 3 0 0 ] ``` :::note The term `kernel` in this context originated in the 1980s with the introduction of the [Single Program, Multiple Data](https://en.wikipedia.org/wiki/Single_program,_multiple_data) (SPMD) parallel programming technique, which underpins ROCm and CUDA. In this approach, a kernel executes concurrently across distinct elements of large data structures. ::: ## Threads Because we passed `block_dim=4`, we launched 4 threads on the x dimension, the kernel code we wrote is executed on each thread. The printing can be out of order depending on which thread reaches that `print()` call first. Now add the y and z dimensions with `block_dim=(2, 2, 2)`: :::note For the `grid_dim` and `block_dim` arguments you can use a single value or a tuple. A single value will launch N blocks/threads on the x dimension, while using a tuple with up to three values will determine the (x, y, z) dimensions. ::: ```mojo :once ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=(2, 2, 2)) ctx.synchronize() ``` ```text GPU thread: [ 0 0 0 ] GPU thread: [ 1 0 0 ] GPU thread: [ 0 1 0 ] GPU thread: [ 1 1 0 ] GPU thread: [ 0 0 1 ] GPU thread: [ 1 0 1 ] GPU thread: [ 0 1 1 ] GPU thread: [ 1 1 1 ] ``` We're now launching 8 (2x2x2) threads in total. ## Host vs device and enqueue You'll see the word host which refers to the CPU that schedules work for the device, device refers to the accelerator which in this case is a GPU. When you encounter the term `enqueue` in a method or function call, it means that the host is scheduling the operation to execute asynchronously on the device. If your host-side code relies on the outcome of these device-enqueued operations, you need to call `ctx.synchronize()`. For instance, printing from the CPU without first synchronizing might result in out-of-order output: ```mojo :once ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4) print("This might print before the GPU has completed its work") ``` ```text This might print before the GPU has completed its work GPU thread: [ 0 0 0 ] GPU thread: [ 1 0 0 ] GPU thread: [ 2 0 0 ] GPU thread: [ 3 0 0 ] ``` In the above example we failed to call `synchronize()` before printing on the host, the device could be slightly slower to finish its work, so you might see that output after the host output. Let's add a `synchronize()` call: ```mojo :once ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4) ctx.synchronize() print("This will print after the GPU has completed its work") ``` ```text GPU thread: [ 0 0 0 ] GPU thread: [ 1 0 0 ] GPU thread: [ 2 0 0 ] GPU thread: [ 3 0 0 ] This will print after the GPU has completed its work ``` Any method or function you `enqueue` to run on the device, will run in the order that you enqueued them. It's only when you're doing something from the host which is dependent on the results of enqueued calls that you have to synchronize. In GPU programming with Mojo, when there's a tradeoff between GPU performance and safety or ergonomics, performance takes priority, aligning with the expectations of kernel engineers. For instance, while we could eliminate the `enqueue` prefix from method calls and force synchronization for each of them, this would come at a performance cost. Take note to remember anything from these warning text blocks for potential safety violations: :::warning Synchronization For any methods or functions prefixed with `enqueue`, you must synchronize before running CPU code that is dependent on what you're enqueuing. Enqueueing multiple method or function calls for a single GPU is safe, as they are scheduled to run in the order you call them. ::: Mojo enhances the safety and ergonomics of C++ GPU programming where it doesn't sacrifice performance. For example, ASAP destruction automatically frees buffer memory on last use of the object, eliminating memory leaks and ensuring memory is released as early as possible. This is an evolution on modern memory management solutions such as C++ RAII, which is scope based and may hold onto memory for longer than expected, which is a precious resource in AI applications. ## Blocks This kernel demonstrates how blocks work: ```mojo :once fn block_kernel(): print( "block: [", block_idx.x, block_idx.y, block_idx.z, "]", "thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]" ) ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2) ctx.synchronize() ``` ```text block: [ 0 0 0 ] thread: [ 0 0 0 ] block: [ 0 0 0 ] thread: [ 1 0 0 ] block: [ 1 0 0 ] thread: [ 0 0 0 ] block: [ 1 0 0 ] thread: [ 1 0 0 ] block: [ 1 1 0 ] thread: [ 0 0 0 ] block: [ 1 1 0 ] thread: [ 1 0 0 ] block: [ 0 1 0 ] thread: [ 0 0 0 ] block: [ 0 1 0 ] thread: [ 1 0 0 ] ``` We're still launching 8 (2x2x2) threads, where there are 4 blocks, each with 2 threads. In GPU programming this grouping of blocks and threads is important, each block can have its own fast SRAM (Static Random Access Memory) which allows threads to communicate. The threads within a block can also communicate through registers, we'll cover this concept when we get to warps. For now the important information to internalize is: - `grid_dim` defines how many blocks are launched. - `block_dim` defines how many threads are launched in each block. ## Tiles The x, y, z dimensions of blocks are important for splitting up large jobs into tiles, so each thread can work on its own subset of the problem. Below is a visualization for how a contiguous array of data can be split up into tiles, if we have an array of UInt32 (Unsigned Integer 32bit) data like: ```plaintext [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ] ``` We could split work up between threads and blocks, we're only going to use the x dimension for threads and blocks to get started: ```plaintext Thread | 0 1 2 3 ------------------------- block 0 | [ 0 1 2 3 ] block 1 | [ 4 5 6 7 ] block 2 | [ 8 9 10 11 ] block 3 | [ 12 13 14 15 ] ``` If you had a much larger data array you could further split it up into tiles, e.g. tile with widths [2, 2] at index (0, 0) would be: ```plaintext [ 0 1 ] [ 4 5 ] ``` And index (2, 0) would be: ```plaintext [ 2 3 ] [ 6 7 ] ``` This is where you'd introduce the y dimension, later we'll being working on image data which is a tensor with 3 dimensions: (height, width, color_channels). For now we're going to focus on how blocks and threads interact, splitting up an array into 1 row per block, and 1 value per thread. ## Buffers First we'll initialize a contiguous array on the GPU: ```mojo alias dtype = DType.uint32 alias blocks = 4 alias threads = 4 alias elements_in = blocks * threads # one element per thread var in_buffer = ctx.enqueue_create_buffer[dtype](elements_in) ``` Creating the GPU buffer is allocating _global memory_ which can be accessed from any block and thread inside a GPU kernel, this memory is relatively slow compared to _shared memory_ which is shared between all of the threads in a block, more on that later. We can't access memory in a GPU address space from CPU to initialize the values unless we map it to host: ```mojo with in_buffer.map_to_host() as host_buffer: iota(host_buffer.unsafe_ptr(), elements_in) print(host_buffer) ``` ```text HostBuffer([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]) ``` If you're loading or storing values from a buffer allocated on GPU, mapping to host ensures the values are copied into the CPU address space when the context manager enters (start of the `with` block), and back to the GPU address space when the context manager exits (end of the `with` block). Note that `map_to_host()` will call `synchronize()` before writing the data back to CPU, so you don't have to call it separately. ## Tensor indexing from threads Now that we have the data set up, we can wrap the data in a [LayoutTensor](/mojo/kernels/layout/layout_tensor/LayoutTensor/) so that we can reason about how to index into the array, allowing each thread to get its corresponding value: ```mojo :clear alias layout = Layout.row_major(blocks, threads) var in_tensor = LayoutTensor[dtype, layout](in_buffer) ``` :::note Memory Layout "Row major" means the values are stored sequentially in memory: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ] "Column major" means memory advances down each column first, then moves to the next column. This layout is used in some GPU tiling kernels because it can align with coalesced column accesses: [ 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 ] ::: `LayoutTensor` is a view of the data in buffer, it does not own the underlying memory. It's a powerful abstraction and offers many advanced methods which we'll dive into in later chapters. We'll create an alias so that we don't have to repeat the type information for each kernel launch: ```mojo :clear alias InTensor = LayoutTensor[dtype, layout, MutableAnyOrigin] ``` More information on [origins here](/mojo/manual/values/lifetimes). Initially we'll just print the values to confirm it's indexing as we expect: ```mojo :once fn print_values_kernel(in_tensor: InTensor): var bid = block_idx.x var tid = thread_idx.x print("block:", bid, "thread:", tid, "val:", in_tensor[bid, tid]) ctx.enqueue_function[print_values_kernel]( in_tensor, grid_dim=blocks, block_dim=threads, ) ctx.synchronize() ``` ```text block: 3 thread: 0 val: 12 block: 3 thread: 1 val: 13 block: 3 thread: 2 val: 14 block: 3 thread: 3 val: 15 block: 1 thread: 0 val: 4 block: 1 thread: 1 val: 5 block: 1 thread: 2 val: 6 block: 1 thread: 3 val: 7 block: 2 thread: 0 val: 8 block: 2 thread: 1 val: 9 block: 2 thread: 2 val: 10 block: 2 thread: 3 val: 11 block: 0 thread: 0 val: 0 block: 0 thread: 1 val: 1 block: 0 thread: 2 val: 2 block: 0 thread: 3 val: 3 ``` As in the visualization above, the block/thread is getting the corresponding value that we expect. You can see `block: 3 thread: 3` has the last value 15. Try experimenting with different `grid_dim`, `block_dim` and indexing values to see how the behavior changes. ## Multiply kernel Now that we've verified we're getting the correct values when indexing, we'll launch a kernel to multiply each value: ```mojo :once fn multiply_kernel[multiplier: Int](in_tensor: InTensor): in_tensor[block_idx.x, thread_idx.x] *= multiplier ctx.enqueue_function[multiply_kernel[2]]( in_tensor, grid_dim=blocks, block_dim=threads, ) # Map to host and print as 2D array with in_buffer.map_to_host() as host_buffer: var host_tensor = LayoutTensor[dtype, layout](host_buffer) print(host_tensor) ``` ```text 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 ``` Congratulations! You've successfully run a kernel that modifies values from your GPU, and printed the result on your CPU. You can see above that each thread multiplied a single value by 2 in parallel on the GPU, and copied the result back to the CPU. ## Sum reduce output We're going to set up a new buffer which will have all the reduced values with the sum of each thread in the block: ```plaintext Output: [ block[0] block[1] block[2] block[3] ] ``` Set up the output buffer/tensor for the host and device: ```mojo :clear var out_buffer = ctx.enqueue_create_buffer[dtype](blocks) # Zero the values on the device as they'll be used to accumulate results _ = out_buffer.enqueue_fill(0) alias out_layout = Layout.row_major(elements_in) alias OutTensor = LayoutTensor[dtype, out_layout, MutableAnyOrigin] var out_tensor = OutTensor(out_buffer) ``` The problem here is that we can't have all the threads summing their values into the same index in the output buffer as that will introduce race conditions. We're going to introduce new concepts to deal with this. ## Shared memory This kernel uses shared memory to accumulate values. Shared memory is much faster than global memory because it resides on-chip, closer to the processing cores, reducing latency and increasing bandwidth. It's not an optimal solution for this kind of reduction operation, but it's a good way to introduce shared memory in a simple example. We'll cover better solutions in the next sections. ```mojo :once fn sum_reduce_kernel( in_tensor: InTensor, out_tensor: OutTensor ): # This allocates memory to be shared between threads in a block prior to the # kernel launching. Each kernel gets a pointer to the allocated memory. var shared = stack_allocation[ threads, Scalar[dtype], address_space = AddressSpace.SHARED, ]() # Place the corresponding value into shared memory shared[thread_idx.x] = in_tensor[block_idx.x, thread_idx.x][0] # Await all the threads to finish loading their values into shared memory barrier() # If this is the first thread, sum and write the result to global memory if thread_idx.x == 0: for i in range(threads): out_tensor[block_idx.x] += shared[i] ctx.enqueue_function[sum_reduce_kernel]( in_tensor, out_tensor, grid_dim=blocks, block_dim=threads, ) # Copy the data back to the host and print out the buffer with out_buffer.map_to_host() as host_buffer: print(host_buffer) ``` ```text HostBuffer([6, 22, 38, 54]) ``` For our first block/tile we summed the values: ```plaintext sum([ 0 1 2 3 ]) == 6 ``` And the reduction resulted in the output having the sum of 6 in the first position. Every tile/block has been reduced to: ```plaintext [ 6 22 38 54] ``` ## Sum multiple values from a single thread We could skip using shared memory altogether by launching a single thread per block. Each thread can load more than a single value, here we'll be launching one thread per block, loading the 4 corresponding values from that block, and summing them together: ```mojo :once fn simd_reduce_kernel( in_tensor: InTensor, out_tensor: OutTensor ): # The [4] means it loads 4 sequential values before doing the `reduce_add` out_tensor[block_idx.x] = in_tensor.load[4](block_idx.x, 0).reduce_add() ctx.enqueue_function[simd_reduce_kernel]( in_tensor, out_tensor, grid_dim=blocks, block_dim=1, # one thread per block ) # Ensure we have the same result with out_buffer.map_to_host() as host_buffer: print(host_buffer) ``` ```text HostBuffer([6, 22, 38, 54]) ``` This is cleaner and faster, instead of 4 threads writing to shared memory, we're using 1 thread per block and summing them together without the intermediate step. However, this can be even faster by launching one thread per value and doing a single instruction in parallel using warps. ## Warps :::note Warps Warp level instructions are an advanced concept, this section is to demonstrate that these low-level primitives are available from Mojo. We'll go into more depth on warps later, so don't worry if it doesn't make sense yet. ::: A _warp_ is a group of threads (32 on NVIDIA GPUs) within a block. Threads within the same warp can synchronize their execution, and take advantage of [Single Instruction, Multiple Threads](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads) (SIMT). SIMT (GPU-focused) allows multiple threads to execute the same instruction on different data with independent control flow and thread states, while SIMD (CPU-focused) applies a single instruction to multiple data elements simultaneously with no thread independence. We have only 4 threads within each block, well under the 32 limit, if this wasn't the case you'd have to do two reductions, one from each warp to shared memory, then another from shared memory to the output buffer or tensor. Here is a simple warp reduction kernel: ```mojo :once fn warp_reduce_kernel( in_tensor: InTensor, out_tensor: OutTensor ): var value = in_tensor.load[1](block_idx.x, thread_idx.x) # Each thread gets the value from one thread higher, summing them as they go value = warp.sum(value) # Print each reduction step in the first block if block_idx.x == 0: print("thread:", thread_idx.x, "value:", value) # Thread 0 has the reduced sum of the values from all the other threads if thread_idx.x == 0: out_tensor[block_idx.x] = value ctx.enqueue_function[warp_reduce_kernel]( in_tensor, out_tensor, grid_dim=blocks, block_dim=threads, ) # Ensure we have the same result with out_buffer.map_to_host() as host_buffer: print(host_buffer) ``` ```text thread: 0 value: 6 thread: 1 value: 6 thread: 2 value: 5 thread: 3 value: 3 HostBuffer([6, 22, 38, 54]) ``` You can see in the output that the first block had the values [0 1 2 3] and was reduced from top to bottom (shuffle down) in this way, where the sum result of one thread is passed to the next thread down: | Thread | value | next_value | result | |--------|-------|------------|--------| | 3 | 3 | N/A | 3 | | 2 | 2 | 3 | 5 | | 1 | 1 | 5 | 6 | | 0 | 0 | 6 | 6 | ## Exercise Now that we've covered some of the core primitives for GPU programming, here's an exercise to solidify your understanding. Feel free to revisit the examples as you work through it the first time, then challenge yourself to write the code independently. Experimenting with the code and observing the results is also a highly valuable way to deepen your skills, don’t hesitate to tweak things and see what happens! 1. Create a host buffer for the input of `DType` `Float32`, with 32 elements, and initialize the numbers ordered sequentially. Copy the host buffer to the device. 2. Create a in_tensor that wraps the host buffer, with the dimensions (8, 4) 3. Create an host and device buffer for the output of `DType` `Float32`, with 8 elements, don't forget to zero the values with `enqueue_memset()`. 4. Launch a GPU kernel with 8 blocks and 4 threads that reduce sums the values, using your preferred method to write to the output buffer. 5. Copy the device buffer to the host buffer, and print it out on the CPU. Click to expand answer. ```mojo :reset from gpu import thread_idx, block_idx, warp from gpu.host import DeviceContext from layout import Layout, LayoutTensor from math import iota alias dtype = DType.float32 alias blocks = 8 alias threads = 4 alias elements_in = blocks * threads # Create context var ctx = DeviceContext() # Create buffers var in_buffer = ctx.enqueue_create_buffer[dtype](elements_in) var out_buffer = ctx.enqueue_create_buffer[dtype](blocks) # Fill in input values sequentially and copy to device with in_buffer.map_to_host() as host_buffer: iota(host_buffer.unsafe_ptr(), elements_in) # Zero output buffer values _ = out_buffer.enqueue_fill(0) # Create the LayoutTensors alias layout = Layout.row_major(blocks, threads) alias InTensor = LayoutTensor[dtype, layout, MutableAnyOrigin] var in_tensor = InTensor(in_buffer) alias out_layout = Layout.row_major(blocks) alias OutTensor = LayoutTensor[dtype, out_layout, MutableAnyOrigin] var out_tensor = OutTensor(out_buffer) fn reduce_sum(in_tensor: InTensor, out_tensor: OutTensor): var value = in_tensor.load[1](block_idx.x, thread_idx.x) value = warp.sum(value) if thread_idx.x == 0: out_tensor[block_idx.x] = value ctx.enqueue_function[reduce_sum]( in_tensor, out_tensor, grid_dim=blocks, block_dim=threads, ) with out_buffer.map_to_host() as host_buffer: print(host_buffer) ``` ```text HostBuffer([6.0, 22.0, 38.0, 54.0, 70.0, 86.0, 102.0, 118.0]) ``` The next chapter is coming soon, in the meantime you can check out some [GPU programming examples here](https://github.com/modular/modular/tree/main/examples/gpu_functions), or learn how you can integrate your GPU programming experience into the Python ecosystem [with custom ops](/max/custom-ops/). --- ## Batch `@register_passable(trivial)` `struct Batch` A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took. ## Fields * ​duration (`Int`): Total duration of batch stored as nanoseconds. * ​iterations (`Int`): Total iterations in the batch. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `mean` `mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` Returns the average duration of the batch. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The average duration of the batch. --- ## batched_matmul `batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_a: Bool, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())` `batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())` --- ## batched_matmul_kernel `batched_matmul_kernel[rank: Int, c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), accum_type: DType = get_accum_type[::DType,::DType]()](c_buff: NDBuffer[c_type, 3, MutableAnyOrigin, c_shape], a_buff: NDBuffer[a_type, 3, MutableAnyOrigin, a_shape], b_buff: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], c_buff_nd_shape: IndexList[rank])` --- ## batched_matmul_shape `batched_matmul_shape[rank: Int, a_type: DType, b_type: DType, single_thread_blocking_override: Bool](a_buff: NDBuffer[a_type, rank, origin], b_buff: NDBuffer[b_type, rank, origin]) -> IndexList[rank]` Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible. **Parameters:** * ​rank (`Int`): Rank of the input and output tensors. * ​a\_type (`DType`): Type of the lhs input tensor. * ​b\_type (`DType`): Type of the rhs input tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​a\_buff (`NDBuffer[a_type, rank, origin]`): The lhs input tensor. * ​b\_buff (`NDBuffer[b_type, rank, origin]`): The rhs input tensor. **Returns:** The output shape. --- ## Batching Batching is the process of combining multiple inference requests into a single forward pass through the model, thus executing multiple requests simultaneously and improving computational efficiency. To account for requests with varying sequence lengths, it's common to add techniques such as [padding](padding-tokens.mdx) (to standardize lengths) or [ragged tensors](ragged-tensors.mdx) (to handle variable lengths directly). Batch sizes can be either static or dynamic. Whereas static batching uses a fixed batch size and thus waits until the system receives a specific number of inference requests before sending them into the model, dynamic batching uses a flexible batch size. For example, dynamic batching may send a batch of requests to the model as soon as the batch either reaches a certain number of requests (batch size limit) or it reaches a timeout threshold. Dynamic batching can get a lot more complicated than that with additional tricks that keep GPUs busy instead of waiting for one batch to finish before starting another. One such strategy for large language models (LLMs) is [continuous batching](continuous-batching.mdx). --- ## Bench `struct Bench` Constructs a Benchmark object, used for running multiple benchmarks and comparing the results. Example: ```mojo from benchmark import ( Bench, BenchConfig, Bencher, BenchId, ThroughputMeasure, BenchMetric, Format, ) from utils import IndexList from gpu.host import DeviceContext from pathlib import Path fn example_kernel(): print("example_kernel") var shape = IndexList[2](1024, 1024) var bench = Bench(BenchConfig(max_iters=100)) @parameter @always_inline fn example(mut b: Bencher, shape: IndexList[2]) capturing raises: @parameter @always_inline fn kernel_launch(ctx: DeviceContext) raises: ctx.enqueue_function[example_kernel]( grid_dim=shape[0], block_dim=shape[1] ) var bench_ctx = DeviceContext() b.iter_custom[kernel_launch](bench_ctx) bench.bench_with_input[IndexList[2], example]( BenchId("top_k_custom", "gpu"), shape, ThroughputMeasure( BenchMetric.elements, shape.flattened_length() ), ThroughputMeasure( BenchMetric.flops, shape.flattened_length() * 3 # number of ops ), ) # Add more benchmarks like above to compare results # Pretty print in table format print(bench) # Dump report to csv file bench.config.out_file = Path("out.csv") bench.dump_report() # Print in tabular csv format bench.config.format = Format.tabular print(bench) ``` You can pass arguments when running a program that makes use of `Bench`: ```sh mojo benchmark.mojo -o out.csv -r 10 ``` This will repeat the benchmarks 10 times and write the output to `out.csv` in csv format. ## Fields * ​config (`BenchConfig`): Constructs a Benchmark object based on specific configuration and mode. * ​mode (`Mode`): Benchmark mode object representing benchmark or test mode. * ​info\_vec (`List[BenchmarkInfo]`): A list containing the benchmark info. ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, config: Optional[BenchConfig] = Optional(None), mode: Mode = Mode(0))` Constructs a Benchmark object based on specific configuration and mode. **Args:** * ​config (`Optional[BenchConfig]`): Benchmark configuration object to control length and frequency of benchmarks. * ​mode (`Mode`): Benchmark mode object representing benchmark or test mode. ### `bench_with_input` `bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())` Benchmarks an input function with input args of type AnyType. **Parameters:** * ​T (`AnyType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)` Benchmarks an input function with input args of type AnyType. **Parameters:** * ​T (`AnyType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. `bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())` Benchmarks an input function with input args of type AnyTrivialRegType. **Parameters:** * ​T (`AnyTrivialRegType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)` Benchmarks an input function with input args of type AnyTrivialRegType. **Parameters:** * ​T (`AnyTrivialRegType`): Benchmark function input type. * ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​input (`T`): Represents the target function's input arguments. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. ### `bench_function` `bench_function[: origin.set, //, bench_fn: fn() raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn() raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn() capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn() capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)` Benchmarks or Tests an input function. **Parameters:** * ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked. **Args:** * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's. ### `dump_report` `dump_report(mut self)` Prints out the report from a Benchmark execution. If `Bench.config.out_file` is set, it will also write the output in the format set in `out_file_format` to the file defined in `out_file`. ### `pad` `pad(self, width: Int, string: String) -> String` Pads a string to a given width. **Args:** * ​width (`Int`): The width to pad the string to. * ​string (`String`): The string to pad. **Returns:** A string padded to the given width. ### `__str__` `__str__(self) -> String` Returns a string representation of the benchmark results. **Returns:** A string representing the benchmark results. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the benchmark results to a writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writer trait. **Args:** * ​writer (`W`): The writer to write to. --- ## BenchConfig `struct BenchConfig` Defines a benchmark configuration struct to control execution times and frequency. ## Fields * ​out\_file (`Optional[Path]`): Output file to write results to. * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs. * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs. * ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs. * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. * ​max\_iters (`Int`): Max number of iterations to run. * ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated. * ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed. * ​show\_progress (`Bool`): If True, print progress of each benchmark. * ​format (`Format`): The format to print results. (default: "table"). * ​out\_file\_format (`Format`): The format to write out the file with `dump_file` (default: "csv"). * ​verbose\_timing (`Bool`): Whether to print verbose timing results. * ​verbose\_metric\_names (`Bool`): If True print the metric name and unit, else print the unit only. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `VERBOSE_TIMING_LABELS` `alias VERBOSE_TIMING_LABELS = List(__init__[__mlir_type.!kgen.string]("min (ms)"), __init__[__mlir_type.!kgen.string]("mean (ms)"), __init__[__mlir_type.!kgen.string]("max (ms)"), __init__[__mlir_type.!kgen.string]("duration (ms)"), Tuple())` Labels to print verbose timing results. ## Methods ### `__init__` `__init__(out self, out_file: Optional[Path] = Optional(None), min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](2), min_warmuptime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_batch_size: Int = 0, max_iters: Int = 1000000000, num_repetitions: Int = 1, flush_denormals: Bool = True)` Constructs and initializes Benchmark config object with default and inputed values. **Args:** * ​out\_file (`Optional[Path]`): Output file to write results to. * ​min\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `0.1`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `1`). * ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs (default `1.0`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated. * ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed. `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. --- ## bencher ## Structs * [​`Bench`](/mojo/stdlib/benchmark/bencher/Bench): Constructs a Benchmark object, used for running multiple benchmarks and comparing the results. * [​`BenchConfig`](/mojo/stdlib/benchmark/bencher/BenchConfig): Defines a benchmark configuration struct to control execution times and frequency. * [​`Bencher`](/mojo/stdlib/benchmark/bencher/Bencher): Defines a Bencher struct which facilitates the timing of a target function. * [​`BenchId`](/mojo/stdlib/benchmark/bencher/BenchId): Defines a benchmark Id struct to identify and represent a particular benchmark execution. * [​`BenchmarkInfo`](/mojo/stdlib/benchmark/bencher/BenchmarkInfo): Defines a Benchmark Info struct to record execution Statistics. * [​`BenchMetric`](/mojo/stdlib/benchmark/bencher/BenchMetric): Defines a benchmark throughput metric. * [​`Format`](/mojo/stdlib/benchmark/bencher/Format): Defines a format for the benchmark output when printing or writing to a file. * [​`Mode`](/mojo/stdlib/benchmark/bencher/Mode): Defines a Benchmark Mode to distinguish between test runs and actual benchmarks. * [​`ThroughputMeasure`](/mojo/stdlib/benchmark/bencher/ThroughputMeasure): Records a throughput metric of metric BenchMetric and value. --- ## Bencher `@register_passable` `struct Bencher` Defines a Bencher struct which facilitates the timing of a target function. ## Fields * ​num\_iters (`Int`): Number of iterations to run the target function. * ​elapsed (`Int`): The total time elapsed when running the target function. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(num_iters: Int) -> Self` Constructs a Bencher object to run and time a function. **Args:** * ​num\_iters (`Int`): Number of times to run the target function. ### `iter` `iter[: origin.set, //, iter_fn: fn() capturing -> None](mut self)` Returns the total elapsed time by running a target function a particular number of times. **Parameters:** * ​iter\_fn (`fn() capturing -> None`): The target function to benchmark. `iter[iter_fn: fn() raises capturing -> None](mut self)` Returns the total elapsed time by running a target function a particular number of times. **Parameters:** * ​iter\_fn (`fn() raises capturing -> None`): The target function to benchmark. ### `iter_preproc` `iter_preproc[: origin.set, : origin.set, //, iter_fn: fn() capturing -> None, preproc_fn: fn() capturing -> None](mut self)` Returns the total elapsed time by running a target function a particular number of times. **Parameters:** * ​iter\_fn (`fn() capturing -> None`): The target function to benchmark. * ​preproc\_fn (`fn() capturing -> None`): The function to preprocess the target function. ### `iter_custom` `iter_custom[: origin.set, //, iter_fn: fn(Int) capturing -> Int](mut self)` Times a target function with custom number of iterations. **Parameters:** * ​iter\_fn (`fn(Int) capturing -> Int`): The target function to benchmark. `iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext) raises capturing -> None](mut self, ctx: DeviceContext)` Times a target GPU function with custom number of iterations via DeviceContext ctx. **Parameters:** * ​kernel\_launch\_fn (`fn(DeviceContext) raises capturing -> None`): The target GPU kernel launch function to benchmark. **Args:** * ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel. `iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext, Int) raises capturing -> None](mut self, ctx: DeviceContext)` Times a target GPU function with custom number of iterations via DeviceContext ctx. **Parameters:** * ​kernel\_launch\_fn (`fn(DeviceContext, Int) raises capturing -> None`): The target GPU kernel launch function to benchmark. **Args:** * ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel. `iter_custom[iter_fn: fn(Int) raises capturing -> Int](mut self)` Times a target function with custom number of iterations. **Parameters:** * ​iter\_fn (`fn(Int) raises capturing -> Int`): The target function to benchmark. ### `iter_custom_multicontext` `iter_custom_multicontext[: origin.set, //, kernel_launch_fn: fn() raises capturing -> None](mut self, ctxs: List[DeviceContext])` Times a target GPU function with custom number of iterations via DeviceContext ctx. **Parameters:** * ​kernel\_launch\_fn (`fn() raises capturing -> None`): The target GPU kernel launch function to benchmark. **Args:** * ​ctxs (`List[DeviceContext]`): The list of GPU DeviceContext's for launching kernel. --- ## BenchId `struct BenchId` Defines a benchmark Id struct to identify and represent a particular benchmark execution. ## Fields * ​func\_name (`String`): The target function name. * ​input\_id (`Optional[String]`): The target function input id phrase. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, func_name: String, input_id: String)` Constructs a Benchmark Id object from input function name and Id phrase. **Args:** * ​func\_name (`String`): The target function name. * ​input\_id (`String`): The target function input id phrase. `@implicit` `__init__(out self, func_name: String)` Constructs a Benchmark Id object from input function name. **Args:** * ​func\_name (`String`): The target function name. `@implicit` `__init__(out self, func_name: StringLiteral[value])` Constructs a Benchmark Id object from input function name. **Args:** * ​func\_name (`StringLiteral[value]`): The target function name. --- ## benchmark Implements the benchmark module for runtime benchmarking. You can import these APIs from the `benchmark` package. For example: ```mojo import benchmark from time import sleep ``` You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return a `Report` where you can get the mean, duration, max, and more: ```mojo fn sleeper(): sleep(.01) var report = benchmark.run[sleeper]() print(report.mean()) ``` ```output 0.012256487394957985 ``` You can print a full report: ```mojo report.print() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012265747899159664 Total: 1.459624 Iters: 119 Warmup Total: 0.025020000000000001 Fastest Mean: 0.0121578 Slowest Mean: 0.012321428571428572 ``` Or all the batch runs: ```mojo report.print_full() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012368649122807017 Total: 1.410026 Iters: 114 Warmup Total: 0.023341000000000001 Fastest Mean: 0.012295586956521738 Slowest Mean: 0.012508099999999999 Batch: 1 Iterations: 20 Mean: 0.012508099999999999 Duration: 0.250162 Batch: 2 Iterations: 46 Mean: 0.012295586956521738 Duration: 0.56559700000000002 Batch: 3 Iterations: 48 Mean: 0.012380562499999999 Duration: 0.59426699999999999 ``` If you want to use a different time unit you can bring in the Unit and pass it in as an argument: ```mojo from benchmark import Unit report.print(Unit.ms) ``` ```output --------------------- Benchmark Report (ms) --------------------- Mean: 0.012312411764705882 Total: 1.465177 Iters: 119 Warmup Total: 0.025010999999999999 Fastest Mean: 0.012015649999999999 Slowest Mean: 0.012421204081632654 ``` The unit's are just aliases for string constants, so you can for example: ```mojo print(report.mean("ms")) ``` ```output 12.199145299145298 ``` Benchmark.run takes four arguments to change the behaviour, to set warmup iterations to 5: ```mojo r = benchmark.run[sleeper](5) ``` ```output 0.012004808080808081 ``` To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a max total time of 4 s: ```mojo r = benchmark.run[sleeper](1, 2, 3, 4) ``` Note that the min total time will take precedence over max iterations ## Structs * [​`Batch`](/mojo/stdlib/benchmark/benchmark/Batch): A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took. * [​`Report`](/mojo/stdlib/benchmark/benchmark/Report): Contains the average execution time, iterations, min and max of each batch. * [​`Unit`](/mojo/stdlib/benchmark/benchmark/Unit): Time Unit used by Benchmark Report. ## Functions * [​`run`](/mojo/stdlib/benchmark/benchmark/run): Benchmarks the function passed in as a parameter. --- ## benchmark Implements the benchmark package for runtime benchmarking. You can import these APIs from the `benchmark` package. For example: ```mojo import benchmark from time import sleep ``` You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return a `Report` where you can get the mean, duration, max, and more: ```mojo fn sleeper(): sleep(.01) var report = benchmark.run[sleeper]() print(report.mean()) ``` ```output 0.012256487394957985 ``` You can print a full report: ```mojo report.print() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012265747899159664 Total: 1.459624 Iters: 119 Warmup Mean: 0.01251 Warmup Total: 0.025020000000000001 Warmup Iters: 2 Fastest Mean: 0.0121578 Slowest Mean: 0.012321428571428572 ``` Or all the batch runs: ```mojo report.print_full() ``` ```output --------------------- Benchmark Report (s) --------------------- Mean: 0.012368649122807017 Total: 1.410026 Iters: 114 Warmup Mean: 0.0116705 Warmup Total: 0.023341000000000001 Warmup Iters: 2 Fastest Mean: 0.012295586956521738 Slowest Mean: 0.012508099999999999 Batch: 1 Iterations: 20 Mean: 0.012508099999999999 Duration: 0.250162 Batch: 2 Iterations: 46 Mean: 0.012295586956521738 Duration: 0.56559700000000002 Batch: 3 Iterations: 48 Mean: 0.012380562499999999 Duration: 0.59426699999999999 ``` If you want to use a different time unit you can bring in the Unit and pass it in as an argument: ```mojo from benchmark import Unit report.print(Unit.ms) ``` ```output --------------------- Benchmark Report (ms) --------------------- Mean: 0.012312411764705882 Total: 1.465177 Iters: 119 Warmup Mean: 0.012505499999999999 Warmup Total: 0.025010999999999999 Warmup Iters: 2 Fastest Mean: 0.012015649999999999 Slowest Mean: 0.012421204081632654 ``` The unit's are just aliases for string constants, so you can for example: ```mojo print(report.mean("ms")) ``` ```output 12.199145299145298 ``` Benchmark.run takes four arguments to change the behaviour, to set warmup iterations to 5: ```mojo r = benchmark.run[sleeper](5) ``` ```output 0.012004808080808081 ``` To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a max total time of 4 s: ```mojo r = benchmark.run[sleeper](1, 2, 3, 4) ``` Note that the min total time will take precedence over max iterations ## Modules * [​`bencher`](/mojo/stdlib/benchmark/bencher/): * [​`benchmark`](/mojo/stdlib/benchmark/benchmark/): Implements the benchmark module for runtime benchmarking. * [​`compiler`](/mojo/stdlib/benchmark/compiler/): * [​`memory`](/mojo/stdlib/benchmark/memory/): * [​`quick_bench`](/mojo/stdlib/benchmark/quick_bench/): --- ## Benchmark MAX on an NVIDIA H100 GPU import SmallCards from '@site/src/components/SmallCards'; :::success MAX Supports many GPU types This article will soon reflect all the GPU types that MAX Supports:\ Available today: H100, H200, A100, A10G, L40s.\ Coming soon - B100s, B200s, MI300X. ::: Performance optimization is a key challenge in deploying AI inference workloads, especially when balancing factors like accuracy, latency, and cost. In this tutorial, we'll show you how to benchmark MAX on an NVIDIA H100 GPU, using a Python script to evaluate key metrics, including the following: - Request throughput - Input and output token throughput - Time-to-first-token (TTFT) - Time per output token (TPOT) Our script ([`benchmark_serving.py`](https://github.com/modular/modular/tree/main/benchmark/benchmark_serving.py)) is adapted from vLLM with additional features, such as client-side GPU metric collection to ensure consistent and comprehensive performance measurement that's tailored to MAX. Before we start the benchmark script, we'll start an endpoint running Llama 3 with MAX. Then we'll use the `benchmark_serving.py` script to send a bunch of inference requests and measure the performance. ## Requirements To get started with this tutorial, you need the following: - **Hardware**: Local access to NVIDIA H100 GPUs - **Python**: Version 3.9 - 3.13 - **Magic**: Follow the [Magic installation guide](/magic/#install-magic) - **Docker and Docker Compose**: Installed with [NVIDIA GPU support](https://docs.docker.com/config/containers/resource_constraints/#gpu) - **Latest NVIDIA drivers**: Refer to the [NVIDIA driver installation guide](https://www.nvidia.com/download/index.aspx) - **NVIDIA Container Toolkit**: Follow the [installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) - **Hugging Face account**: Obtain an [access token](https://huggingface.co/settings/tokens) and set it as an environment variable: ```bash export HF_TOKEN="your_huggingface_token" ``` ## Set up your environment From here on, you should be running commands on the system with the NVIDIA GPU. If you haven't already, open a shell to that system now. Clone the MAX repository, navigate to the `benchmark` folder, and install the dependencies in a virtual environment with the following commands: ```bash git clone -b stable https://github.com/modular/modular.git cd max/benchmark magic shell ``` :::note To exit the `magic` shell simply run `exit`. For more information, see [the Magic tutorial](/max/tutorials/magic). ::: ## Prepare benchmarking dataset (optional) This tutorial uses the `--dataset-name` argument in our benchmark script to automatically download the `sharegpt` or `code-debug` datasets for benchmarking. You can optionally provide a path to your own dataset using the `--dataset-path` argument. For example, you can download the [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) dataset with the following command: ```bash wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json ``` You can then reference the local dataset using the `--dataset-path` argument: ```bash python benchmark_serving.py \ ... --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \ ``` For more information on available benchmarking datasets, see [Command line arguments for `benchmark_serving.py`](https://github.com/modular/modular/tree/main/benchmark#command-line-arguments-for-benchmark_servingpy). ## Start the model endpoint We provide a pre-configured GPU-enabled Docker container that simplifies the process do deploy an endpoint with MAX. For more information, see [MAX container](/max/container). To pull and run the MAX container that hosts Llama 3 as an endpoint, run this command: ```bash docker run --rm --gpus=all \ --ipc=host \ -p 8000:8000 \ --env "HF_TOKEN=${HF_TOKEN}" \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ modular/max-nvidia-full:latest \ --model-path meta-llama/Llama-3.3-70B-Instruct \ --devices gpu:0,1,2,3 \ --max-num-steps 10 \ --max-batch-size 512 ``` where `--devices gpu:0,1,2,3` refers to the GPU IDs to use. Note that Llama3.3-70B requires 4xH100 or 4xA100 instances to run in bfloat16 precision. You can explore other model options in the MAX [model repository](https://builds.modular.com/?category=models). :::note These settings work well on H100 GPUs. You can adjust `--max-batch-size` depending on your system's available resources such as GPU memory. ::: You'll know that the server is running when you see the following log: ```output Server ready on http://0.0.0.0:8000 ``` ## Start benchmarking To benchmark MAX with 8 prompts from the `code_debug` dataset, run this command: ```bash python benchmark_serving.py \ --backend modular \ --model meta-llama/Llama-3.3-70B-Instruct \ --dataset-name code_debug \ --endpoint /v1/completions \ --num-prompts 8 \ --collect-gpu-stats ``` For more information on available arguments, see the [MAX benchmarking reference](https://github.com/modular/modular/tree/main/benchmark#reference). :::tip Optional cleanup Here's how to clean up the Docker image: ```bash docker rmi $(docker images -q modular/max-nvidia-full:latest) ``` ::: ## Interpret the results The output should look similar to the following: ```output ============ Serving Benchmark Result ============ Successful requests: 8 Failed requests: 0 Benchmark duration (s): 90.00 Total input tokens: 712840 Total generated tokens: 16 Request throughput (req/s): 0.09 Input token throughput (tok/s): 7920.01 Output token throughput (tok/s): 0.18 ---------------Time to First Token---------------- Mean TTFT (ms): 46506.48 Median TTFT (ms): 44050.82 P99 TTFT (ms): 88887.81 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 17790.64 Median TPOT (ms): 17292.79 P99 TPOT (ms): 38986.51 ---------------Inter-token Latency---------------- Mean ITL (ms): 17790.57 Median ITL (ms): 17292.70 P99 ITL (ms): 38986.49 -------------------Token Stats-------------------- Max input tokens: 109256 Max output tokens: 2 Max total tokens: 109258 --------------------GPU Stats--------------------- GPU Utilization (%): 99.24 Peak GPU Memory Used (MiB): 76312.88 GPU Memory Available (MiB): 5030.75 ================================================== ``` For more information about each metric, see the [MAX benchmarking key metrics](https://github.com/modular/modular/tree/main/benchmark#key-metrics-explained). ### Measure latency with finite request rates Latency metrics like time-to-first-token (TTFT) and time per output token (TPOT) matter most when the server isn't overloaded. An overloaded server will queue requests, which results in a massive increase in latency that varies depending on the size of the benchmark more than the actual latency of the server—larger benchmarks result in a deeper queue. If you'd like to vary the size of the queue, you can adjust the request rate with the `--request-rate ` flag. This creates a stochastic request load with an average rate of `N` requests per second. ### Comparing to alternatives You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM backends to compare performance with alternative LLM serving frameworks. When using the TensorRT-LLM backend, be sure to change the `--endpoint` to `/v2/models/ensemble/generate_stream`. MAX achieves competitive throughput on most workloads and will further improve with upcoming optimizations. ## Next steps Now that you have detailed benchmarking results for Llama 3 on MAX using an NVIDIA H100 GPU, here are some other topics to explore next: export const cards = [ { title: 'Deploy Llama 3 on GPU with MAX', link: '/max/tutorials/max-serve-local-to-cloud', description: `Learn how to deploy Llama 3 on GPU with MAX.`, }, { title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters', link: '/max/tutorials/deploy-max-serve-on-kubernetes', description: `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`, }, { title: 'Bring your own fine-tuned model to MAX pipelines', link: '/max/tutorials/max-pipeline-bring-your-own-model', description: `Learn how to customize your own model in MAX pipelines.`, }, { title: 'Get started with MAX Graph in Python', link: '/max/tutorials/get-started-with-max-graph-in-python', description: `Learn how to build a model graph with our Python API for inference with MAX Engine.`, }, ]; To read more about our performance methodology, check our our blog post, [MAX GPU: State of the Art Throughput on a New GenAI platform](https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform). You can also share your experience on the [Modular Forum](https://forum.modular.com/) and in our [Discord Community](https://discord.gg/modular). Be sure to stay up to date with all the performance improvements coming soon by [signing up for our newsletter](https://www.modular.com/modverse#signup). --- ## BenchmarkInfo `struct BenchmarkInfo` Defines a Benchmark Info struct to record execution Statistics. ## Fields * ​name (`String`): The name of the benchmark. * ​result (`Report`): The output report after executing a benchmark. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. * ​verbose\_timing (`Bool`): Whether to print verbose timing results. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, name: String, result: Report, measures: List[ThroughputMeasure] = List(), verbose_timing: Bool = False)` Constructs a `BenchmarkInfo` object to return benchmark report and statistics. **Args:** * ​name (`String`): The name of the benchmark. * ​result (`Report`): The output report after executing a benchmark. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. * ​verbose\_timing (`Bool`): Whether to print verbose timing results. `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. --- ## BenchMetric `struct BenchMetric` Defines a benchmark throughput metric. ## Fields * ​code (`Int`): Op-code of the Metric. * ​name (`String`): Metric's name. * ​unit (`String`): Metric's throughput rate unit (count/second). ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `bytes` `alias bytes = BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s"))` ### `DEFAULTS` `alias DEFAULTS = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple())` Default set of benchmark metrics. ### `elements` `alias elements = BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s"))` ### `flops` `alias flops = BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))` ### `theoretical_flops` `alias theoretical_flops = BenchMetric(3, __init__[__mlir_type.!kgen.string]("TheoreticalArithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))` ## Methods ### `__init__` `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two metrics for equality. **Args:** * ​other (`Self`): The metric to compare. **Returns:** True if the two metrics are equal. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two metrics for inequality. **Args:** * ​other (`Self`): The metric to compare. **Returns:** True if the two metrics are NOT equal. ### `__str__` `__str__(self) -> String` Gets a string representation of this metric. **Returns:** The string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this BenchMetric to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `check_name` `check_name(self, alt_name: String) -> Bool` Checks whether a string contains the metric's name. **Args:** * ​alt\_name (`String`): Alternative name of a metric. **Returns:** True if 'alt\_name' is valid alternative of the metric's name. ### `get_metric_from_list` `static get_metric_from_list(name: String, metric_list: List[BenchMetric]) -> Self` Gets a metric from a given list using only the metric's name. **Args:** * ​name (`String`): Metric's name. * ​metric\_list (`List[BenchMetric]`): List of metrics to search. **Returns:** The selected metric. --- ## bin `bin(num: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String` Return the binary string representation an integral value. ```mojo print(bin(123)) print(bin(-123)) ``` ```plaintext '0b1111011' '-0b1111011' ``` **Args:** * ​num (`SIMD[dtype, 1]`): An integral scalar value. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** The binary string representation of num. `bin(b: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String` Returns the binary representation of a scalar bool. **Args:** * ​b (`SIMD[bool, 1]`): A scalar bool value. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** The binary string representation of b. `bin[T: Intable, //](num: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String` Returns the binary representation of an indexer type. **Parameters:** * ​T (`Intable`): The Intable type. **Args:** * ​num (`T`): An indexer value. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** The binary string representation of num. --- ## bindings ## Aliases ### `MOJO_PYTHON_TYPE_OBJECTS` `alias MOJO_PYTHON_TYPE_OBJECTS = _Global[__init__[__mlir_type.!kgen.string]("MOJO_PYTHON_TYPE_OBJECTS"), Dict[StringSlice[StaticConstantOrigin], TypedPythonObject[__init__[__mlir_type.!kgen.string]("Type")]], _init_python_type_objects]` Mapping of Mojo type identifiers to unique `PyTypeObject*` binding that Mojo type to this CPython interpreter instance. ### `Typed_initproc` `alias Typed_initproc = fn(PyObjectPtr, TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], PyObjectPtr) -> SIMD[int32, 1]` ### `Typed_newfunc` `alias Typed_newfunc = fn(UnsafePointer[PyTypeObject], TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], PyObjectPtr) -> PyObjectPtr` ## Structs * [​`PyMojoObject`](/mojo/stdlib/python/bindings/PyMojoObject): Storage backing a PyObject\* wrapping a Mojo value. * [​`PythonModuleBuilder`](/mojo/stdlib/python/bindings/PythonModuleBuilder): A builder for creating Python modules with Mojo function and type bindings. * [​`PythonTypeBuilder`](/mojo/stdlib/python/bindings/PythonTypeBuilder): A builder for a Python 'type' binding. ## Functions * [​`check_arguments_arity`](/mojo/stdlib/python/bindings/check_arguments_arity): Validate that the provided arguments match the expected function arity. * [​`lookup_py_type_object`](/mojo/stdlib/python/bindings/lookup_py_type_object): Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`. --- ## bit Provides functions for bit manipulation. You can import these APIs from the `bit` package. For example: ```mojo from bit import count_leading_zeros ``` ## Functions * [​`bit_not`](/mojo/stdlib/bit/bit/bit_not): Performs a bitwise NOT operation on an SIMD vector of integer values. * [​`bit_reverse`](/mojo/stdlib/bit/bit/bit_reverse): Reverses the bitpattern of an integer value. * [​`bit_width`](/mojo/stdlib/bit/bit/bit_width): Computes the minimum number of bits required to represent the integer. * [​`byte_swap`](/mojo/stdlib/bit/bit/byte_swap): Byte-swaps an integer value with an even number of bytes. * [​`count_leading_zeros`](/mojo/stdlib/bit/bit/count_leading_zeros): Counts the number of leading zeros of an integer. * [​`count_trailing_zeros`](/mojo/stdlib/bit/bit/count_trailing_zeros): Counts the number of trailing zeros for an integer. * [​`log2_floor`](/mojo/stdlib/bit/bit/log2_floor): Returns the floor of the base-2 logarithm of an integer value. * [​`next_power_of_two`](/mojo/stdlib/bit/bit/next_power_of_two): Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1. * [​`pop_count`](/mojo/stdlib/bit/bit/pop_count): Counts the number of bits set in an integer value. * [​`prev_power_of_two`](/mojo/stdlib/bit/bit/prev_power_of_two): Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0. * [​`rotate_bits_left`](/mojo/stdlib/bit/bit/rotate_bits_left): Shifts the bits of an input to the left by `shift` bits (with wrap-around). * [​`rotate_bits_right`](/mojo/stdlib/bit/bit/rotate_bits_right): Shifts the bits of an input to the right by `shift` bits (with wrap-around). --- ## bit Implements the bit package. ## Modules * [​`bit`](/mojo/stdlib/bit/bit/): Provides functions for bit manipulation. --- ## bit_not `bit_not[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs a bitwise NOT operation on an SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is computed as a bitwise NOT of the integer value at position `i` of the input value. --- ## bit_reverse `bit_reverse(val: Int) -> Int` Reverses the bitpattern of an integer value. **Args:** * ​val (`Int`): The input value. **Returns:** The input value with its bitpattern reversed. `bit_reverse[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Element-wise reverses the bitpattern of a SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` has a reversed bitpattern of an integer value of the element at position `i` of the input value. --- ## bit_width `bit_width(val: Int) -> Int` Computes the minimum number of bits required to represent the integer. **Args:** * ​val (`Int`): The input value. **Returns:** The number of bits required to represent the integer. `bit_width[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the minimum number of bits required to represent each element of a SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` equals the number of bits required to represent the integer at position `i` of the input. --- ## bitcast `bitcast[dtype: DType, width: Int, //, new_type: DType, new_width: Int = width](val: SIMD[dtype, width]) -> SIMD[new_type, new_width]` Bitcasts a SIMD value to another SIMD value. For a discussion of byte order, see [Converting data: bitcasting and byte order](/mojo/manual/pointers/unsafe-pointers#converting-data-bitcasting-and-byte-order) in the Mojo Manual. Examples: The following example uses `bitcast` to break a 32-bit integer into a vector of four 8-bit integers: ```mojo from memory import bitcast one = SIMD[DType.uint32, 1](4631) many = bitcast[DType.uint8, 4](one) print(one, many) # 4631 [23, 18, 0, 0] ``` **Constraints:** The bitwidth of the two types must be the same. **Parameters:** * ​dtype (`DType`): The source type. * ​width (`Int`): The source width. * ​new\_type (`DType`): The target type. * ​new\_width (`Int`): The target width. **Args:** * ​val (`SIMD[dtype, width]`): The source value. **Returns:** A new SIMD value with the specified type and width with a bitcopy of the source SIMD value. --- ## bitset Provides a compact, grow-only set of non-negative integers. Optimized for space (1 bit per element) and speed (O(1) operations). Offers set/clear/test/toggle and fast population count. The underlying storage grows automatically but does not shrink unless `shrink_to_fit` is called (not implemented yet). Example: ```mojo var bs = BitSet[128]() # 128-bit set, all clear bs.set(42) # Mark value 42 as present. if bs.test(42): # Check membership. print("hit") # Prints "hit". bs.clear(42) # Remove 42. print(bs.count()) # Prints 0. ``` ## Structs * [​`BitSet`](/mojo/stdlib/collections/bitset/BitSet): A grow-only set storing non-negative integers efficiently using bits. --- ## BitSet `struct BitSet[size: UInt]` A grow-only set storing non-negative integers efficiently using bits. Each integer element is represented by a single bit within an array of 64-bit words (`UInt64`). This structure is optimized for: * **Compactness:** Uses 64 times less memory than `List[Bool]`. * **Speed:** Offers O(1) time complexity for `set`, `clear`, `test`, and `toggle` operations (single word load/store). Ideal for applications like data-flow analysis, graph algorithms, or any task requiring dense sets of small integers where memory and lookup speed are critical. ## Parameters * ​size (`UInt`): The maximum number of bits the bitset can store. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self)` Initializes an empty BitSet with zero capacity and size. `__init__(out self: BitSet[UInt(size)], init: SIMD[bool, size])` Initializes a BitSet with the given SIMD vector of booleans. **Args:** * ​init (`SIMD[bool, size]`): A SIMD vector of booleans to initialize the bitset with. ### `__bool__` `__bool__(self) -> Bool` Checks if the bitset is non-empty (contains at least one set bit). Equivalent to `len(self) != 0` or `not self.is_empty()`. **Returns:** True if at least one bit is set, False otherwise. ### `__len__` `__len__(self) -> Int` Counts the total number of bits that are set to 1 in the bitset. Uses the efficient `pop_count` intrinsic for each underlying word. The complexity is proportional to the number of words used by the bitset's capacity (`_words_size`), not the logical size (`len`). **Returns:** The total count of set bits (population count). ### `is_empty` `is_empty(self) -> Bool` Checks if the bitset contains any set bits. Equivalent to `len(self) == 0`. Note that this checks the logical size, not the allocated capacity. **Returns:** True if no bits are set (logical size is 0), False otherwise. ### `set` `set(mut self, idx: UInt)` Sets the bit at the specified index `idx` to 1. If `idx` is greater than or equal to the current logical size, the logical size is updated. Aborts if `idx` is negative or greater than or equal to the compile-time `size`. **Args:** * ​idx (`UInt`): The non-negative index of the bit to set (must be ### `clear` `clear(mut self, idx: UInt)` Clears the bit at the specified index `idx` (sets it to 0). Aborts if `idx` is negative or greater than or equal to the compile-time `size`. Does not change the logical size. **Args:** * ​idx (`UInt`): The non-negative index of the bit to clear (must be ### `toggle` `toggle(mut self, idx: UInt)` Toggles (inverts) the bit at the specified index `idx`. If the bit becomes 1 and `idx` is greater than or equal to the current logical size, the logical size is updated. Aborts if `idx` is negative or greater than or equal to the compile-time `size`. **Args:** * ​idx (`UInt`): The non-negative index of the bit to toggle (must be ### `test` `test(self, idx: UInt) -> Bool` Tests if the bit at the specified index `idx` is set (is 1). Aborts if `idx` is negative or greater than or equal to the compile-time `size`. **Args:** * ​idx (`UInt`): The non-negative index of the bit to test (must be ### `clear_all` `clear_all(mut self)` Clears all bits in the set, resetting the logical size to 0. The allocated storage capacity remains unchanged. Equivalent to re-initializing the set with `Self()`. ### `union` `union(self, other: Self) -> Self` Returns a new bitset that is the union of `self` and `other`. **Args:** * ​other (`Self`): The bitset to union with. **Returns:** A new bitset containing all elements from both sets. ### `intersection` `intersection(self, other: Self) -> Self` Returns a new bitset that is the intersection of `self` and `other`. **Args:** * ​other (`Self`): The bitset to intersect with. **Returns:** A new bitset containing only the elements present in both sets. ### `difference` `difference(self, other: Self) -> Self` Returns a new bitset that is the difference of `self` and `other`. **Args:** * ​other (`Self`): The bitset to subtract from `self`. **Returns:** A new bitset containing elements from `self` that are not in `other`. ### `write_to` `write_to[W: Writer, //](self, mut writer: W)` Writes a string representation of the set bits to the given writer. Outputs the indices of the set bits in ascending order, enclosed in curly braces and separated by commas (e.g., "{1, 5, 42}"). Uses efficient bitwise operations to find set bits without iterating through every possible bit. **Parameters:** * ​W (`Writer`): The type of the writer, conforming to the `Writer` trait. **Args:** * ​writer (`W`): The writer instance to output the representation to. ### `__repr__` `__repr__(self) -> String` Returns a developer-friendly string representation of the bitset. Currently equivalent to `__str__`. **Returns:** A string showing the set bits (e.g., "{1, 5, 42}"). ### `__str__` `__str__(self) -> String` Returns a user-friendly string representation of the bitset. Formats the set bits as a comma-separated list within curly braces, like "{1, 5, 42}". Uses the `write_to` method internally. **Returns:** A string showing the set bits. --- ## bitwidthof `bitwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int` Returns the size of (in bits) of the type. **Parameters:** * ​type (`AnyTrivialRegType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The size of the type in bits. `bitwidthof[dtype: DType, target: target = _current_target()]() -> Int` Returns the size of (in bits) of the dtype. **Parameters:** * ​dtype (`DType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The size of the dtype in bits. --- ## block GPU block-level operations and utilities. This module provides block-level operations for NVIDIA and AMD GPUs, including: * Block-wide reductions: * sum: Compute sum across block * max: Find maximum value across block * min: Find minimum value across block * broadcast: Broadcast value to all threads The module builds on warp-level operations from the warp module, extending them to work across a full thread block (potentially multiple warps). It handles both NVIDIA and AMD GPU architectures and supports various data types with SIMD vectorization. ## Functions * [​`broadcast`](/mojo/stdlib/gpu/block/broadcast): Broadcasts a value from a source thread to all threads in a block. * [​`max`](/mojo/stdlib/gpu/block/max): Computes the maximum value across all threads in a block. * [​`min`](/mojo/stdlib/gpu/block/min): Computes the minimum value across all threads in a block. * [​`prefix_sum`](/mojo/stdlib/gpu/block/prefix_sum): Performs a prefix sum (scan) operation across all threads in a block. * [​`sum`](/mojo/stdlib/gpu/block/sum): Computes the sum of values across all threads in a block. --- ## Block index In GPU programming, a block index uniquely identifies a subset of [threads](thread) that execute a [kernel](kernel.mdx) function on the GPU. Threads are grouped into units called [blocks](thread-block.mdx), and multiple blocks together form a larger structure known as a [grid](grid.mdx). Each block within the grid is assigned a unique block index, which can be represented across one, two, or three dimensions. This allows for flexible organization of threads to match the structure of the problem being solved. Within each block, individual threads have their own [thread index](thread-index.mdx), which, together with the block index, determines which part of the problem each thread should work on. This hierarchical structure of grids, blocks, and threads enables efficient workload distribution across the many processing cores of the GPU, maximizing parallel performance. Because a programmer can arrange thread blocks within a grid across one, two, or three dimensions, a block index is a 3-element vector of x, y, and z coordinates. For 2-dimensional arrangements, the z coordinate of all block indices is 0, and for 1-dimensional arrangements, both the y and z coordinates of all block indices are 0. --- ## block_Q4_K `struct block_Q4_K` ## Fields * ​base\_scale (`SIMD[float16, 1]`): * ​base\_min (`SIMD[float16, 1]`): * ​q\_scales\_and\_mins (`InlineArray[SIMD[uint8, 1], 12]`): * ​q\_bits (`InlineArray[SIMD[uint8, 1], 128]`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `group_count` `alias group_count = 8` ### `group_size` `alias group_size = 32` --- ## block_Q6_K `struct block_Q6_K` ## Fields * ​q\_bits\_lo (`InlineArray[SIMD[uint8, 1], 128]`): * ​q\_bits\_hi (`InlineArray[SIMD[uint8, 1], 64]`): * ​q\_scales (`InlineArray[SIMD[int8, 1], 16]`): * ​base\_scale (`SIMD[float16, 1]`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `group_count` `alias group_count = 16` ### `group_size` `alias group_size = 16` --- ## block_QK_K `struct block_QK_K` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `quantized_k` `alias quantized_k = 256` --- ## block_rank_in_cluster `block_rank_in_cluster() -> SIMD[uint32, 1]` Returns the unique identifier (rank) for the current thread block within its cluster. Note: * Only supported on NVIDIA SM90+ GPUs. * Maps directly to the `%cluster_ctarank` special register in CUDA PTX. **Returns:** A unique identifier in the range \[0, cluster\_size-1] where `cluster_size` is the total number of thread blocks in the cluster. --- ## block_reduce `block_reduce[type: DType, //, warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]` --- ## block_reduce `block_reduce[type: DType, max_warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]` --- ## block_swizzle `block_swizzle(block_idx: IndexList[2, element_type=element_type], grid_dim: IndexList[2, element_type=element_type]) -> IndexList[2, element_type=element_type]` --- ## blocked_product `blocked_product(layout_a: Layout, layout_b: Layout) -> Layout` Creates a blocked layout by combining two layouts. This function creates a hierarchical blocked layout by combining a base layout with a block layout. The result is a layout where each element of the base layout is replaced by a block defined by the second layout. This is particularly useful for creating tiled layouts for efficient cache utilization in tensor operations like matrix multiplication. Example: ```mojo from layout import Layout from layout.layout import blocked_product # Create a 2x3 matrix layout var matrix = Layout.row_major(2, 3) # Define 2x2 blocks var block = Layout.row_major(2, 2) # Create a blocked layout with 2x2 blocks var blocked = blocked_product(block, matrix) ``` Output: ```plaintext (((2, 2), (2, 3)):((2, 12), (1, 4))) 0 1 2 3 4 5 +----+----+----+----+----+----+ 0 | 0 | 1 | 4 | 5 | 8 | 9 | +----+----+----+----+----+----+ 1 | 2 | 3 | 6 | 7 | 10 | 11 | +----+----+----+----+----+----+ 2 | 12 | 13 | 16 | 17 | 20 | 21 | +----+----+----+----+----+----+ 3 | 14 | 15 | 18 | 19 | 22 | 23 | +----+----+----+----+----+----+ ``` . **Args:** * ​layout\_a (`Layout`): The base layout to be blocked. * ​layout\_b (`Layout`): The block layout defining the structure within each block. **Returns:** A new layout representing the blocked structure --- ## BlockingScopedLock `struct BlockingScopedLock` A scope adapter for BlockingSpinLock. ## Fields * ​lock (`UnsafePointer[BlockingSpinLock]`): The underlying lock instance. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `LockType` `alias LockType = BlockingSpinLock` The type of the lock. ## Methods ### `__init__` `__init__(out self, lock: UnsafePointer[BlockingSpinLock])` Primary constructor. **Args:** * ​lock (`UnsafePointer[BlockingSpinLock]`): A pointer to the underlying lock. `__init__(out self, mut lock: BlockingSpinLock)` Secondary constructor. **Args:** * ​lock (`BlockingSpinLock`): A mutable reference to the underlying lock. ### `__enter__` `__enter__(mut self)` Acquire the lock on entry. This is done by setting the owner of the lock to own address. ### `__exit__` `__exit__(mut self)` Release the lock on exit. Reset the address on the underlying lock. --- ## BlockingSpinLock `struct BlockingSpinLock` A basic locking implementation that uses an integer to represent the owner of the lock. ## Fields * ​counter (`Atomic[int64]`): The atomic counter implementing the spin lock. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `UNLOCKED` `alias UNLOCKED = -1` non-zero means locked, -1 means unlocked. ## Methods ### `__init__` `__init__(out self)` Default constructor. ### `lock` `lock(mut self, owner: Int)` Acquires the lock. **Args:** * ​owner (`Int`): The lock's owner (usually an address). ### `unlock` `unlock(mut self, owner: Int) -> Bool` Releases the lock. **Args:** * ​owner (`Int`): The lock's owner (usually an address). **Returns:** The successful release of the lock. --- ## bmm ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None` ## Functions * [​`batched_matmul`](./batched_matmul): * [​`batched_matmul_kernel`](./batched_matmul_kernel): * [​`batched_matmul_shape`](./batched_matmul_shape): Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible. --- ## bool Implements the Bool class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Bool`](/mojo/stdlib/builtin/bool/Bool): The primitive Bool scalar value used in Mojo. ## Traits * [​`Boolable`](/mojo/stdlib/builtin/bool/Boolable): The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions. * [​`ImplicitlyBoolable`](/mojo/stdlib/builtin/bool/ImplicitlyBoolable): The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`. ## Functions * [​`all`](/mojo/stdlib/builtin/bool/all): Checks if **all** elements in the list are truthy. * [​`any`](/mojo/stdlib/builtin/bool/any): Checks if **any** element in the list is truthy. --- ## Bool `@register_passable(trivial)` `struct Bool` The primitive Bool scalar value used in Mojo. ## Fields * ​value (`i1`): The underlying storage of the boolean value. ## Implemented traits `AnyType`, `Boolable`, `Comparable`, `ConvertibleFromPython`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `Floatable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `ImplicitlyBoolable`, `ImplicitlyIntable`, `Indexer`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `PythonConvertible`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `MAX` `alias MAX = __init__[::Boolable](True)` The maximum value of a Bool. ### `MIN` `alias MIN = __init__[::Boolable](False)` The minimum value of a Bool. ## Methods ### `__init__` `__init__() -> Self` Construct a default, `False` Bool. `@implicit` `__init__[T: ImplicitlyBoolable, //](value: T) -> Self` Convert an ImplicitlyBoolable value to a Bool. **Parameters:** * ​T (`ImplicitlyBoolable`): The ImplicitlyBoolable type. **Args:** * ​value (`T`): The boolable value. `__init__[T: Boolable, //](value: T) -> Self` Set the bool representation of the object. **Parameters:** * ​T (`Boolable`): The type of the object. **Args:** * ​value (`T`): The object to get the bool representation of. `__init__(value: None) -> Self` Set the bool representation of the `None` type to `False`. **Args:** * ​value (`None`): The object to get the bool representation of. `@implicit` `__init__(value: SIMD[bool, 1]) -> Self` Convert a scalar SIMD value to a Bool. **Args:** * ​value (`SIMD[bool, 1]`): The scalar value. ### `__bool__` `__bool__(self) -> Self` Convert to Bool. **Returns:** This value. ### `__neg__` `__neg__(self) -> Int` Defines the unary `-` operation. **Returns:** 0 for False and -1 for True. ### `__invert__` `__invert__(self) -> Self` Inverts the Bool value. **Returns:** True if the object is false and False otherwise. ### `__lt__` `__lt__(self, rhs: Self) -> Self` Compare this Bool to RHS using less-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is False and rhs is True. ### `__le__` `__le__(self, rhs: Self) -> Self` Compare this Bool to RHS using less-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is False and rhs is True or False. ### `__eq__` `__eq__(self, rhs: Self) -> Self` Compare this Bool to RHS. Performs an equality comparison between the Bool value and the argument. This method gets invoked when a user uses the `==` infix operator. **Args:** * ​rhs (`Self`): The rhs value of the equality statement. **Returns:** True if the two values match and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Self` Compare this Bool to RHS. Performs a non-equality comparison between the Bool value and the argument. This method gets invoked when a user uses the `!=` infix operator. **Args:** * ​rhs (`Self`): The rhs value of the non-equality statement. **Returns:** False if the two values do match and True otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Self` Compare this Bool to RHS using greater-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is True and rhs is False. ### `__ge__` `__ge__(self, rhs: Self) -> Self` Compare this Bool to RHS using greater-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** True if self is True and rhs is True or False. ### `__and__` `__and__(self, rhs: Self) -> Self` Returns `self & rhs`. Bitwise and's the Bool value with the argument. This method gets invoked when a user uses the `and` infix operator. **Args:** * ​rhs (`Self`): The right hand side of the `and` statement. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Returns `self | rhs`. Bitwise or's the Bool value with the argument. This method gets invoked when a user uses the `or` infix operator. **Args:** * ​rhs (`Self`): The right hand side of the `or` statement. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Returns `self ^ rhs`. Bitwise Xor's the Bool value with the argument. This method gets invoked when a user uses the `^` infix operator. **Args:** * ​rhs (`Self`): The right hand side of the `xor` statement. **Returns:** `self ^ rhs`. ### `__rand__` `__rand__(self, lhs: Self) -> Self` Returns `lhs & self`. **Args:** * ​lhs (`Self`): The left hand side of the `and` statement. **Returns:** `lhs & self`. ### `__ror__` `__ror__(self, lhs: Self) -> Self` Returns `lhs | self`. **Args:** * ​lhs (`Self`): The left hand side of the `or` statement. **Returns:** `lhs | self`. ### `__rxor__` `__rxor__(self, lhs: Self) -> Self` Returns `lhs ^ self`. **Args:** * ​lhs (`Self`): The left hand side of the `xor` statement. **Returns:** `lhs ^ self`. ### `__iand__` `__iand__(mut self, rhs: Self)` Computes `self & rhs` and store the result in `self`. **Args:** * ​rhs (`Self`): The right hand side of the `and` statement. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Computes `self ^ rhs` and stores the result in `self`. **Args:** * ​rhs (`Self`): The right hand side of the `xor` statement. ### `__ior__` `__ior__(mut self, rhs: Self)` Computes `self | rhs` and store the result in `self`. **Args:** * ​rhs (`Self`): The right hand side of the `or` statement. ### `copy` `copy(self) -> Self` Explicitly construct a deep copy of the provided value. **Returns:** A copy of the value. ### `__as_bool__` `__as_bool__(self) -> Self` Convert to Bool. **Returns:** This value. ### `__str__` `__str__(self) -> String` Get the bool as a string. Returns `"True"` or `"False"`. **Returns:** A string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this boolean to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Get the bool as a string. Returns `"True"` or `"False"`. **Returns:** A string representation. ### `__int__` `__int__(self) -> Int` Convert this Bool to an integer. **Returns:** 1 if the Bool is True, 0 otherwise. ### `__as_int__` `__as_int__(self) -> Int` Implicitly convert to an integral representation of the value, wherever an `Int` is expected. **Returns:** The integral representation of the value. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** 1 if the Bool is True, 0 otherwise. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Convert this Bool to a float. **Returns:** 1.0 if True else 0.0 otherwise. ### `__hash__` `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. --- ## Boolable The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions. This trait requires the type to implement the `__bool__()` method. For example: ```mojo struct Foo(Boolable): var val: Bool fn __bool__(self) -> Bool: return self.val ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__bool__` `__bool__(self: _Self) -> Bool` Get the boolean representation of the value. **Returns:** The boolean representation of the value. --- ## bottom_k_shape `bottom_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], k: Int, axis: Int) -> IndexList[rank]` --- ## BoundingBox `struct BoundingBox[type: DType]` ## Fields * ​nw (`SIMD[type, 2]`): * ​se (`SIMD[type, 2]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, y1: SIMD[type, 1], x1: SIMD[type, 1], y2: SIMD[type, 1], x2: SIMD[type, 1])` ### `iou` `iou(self, other: Self) -> SIMD[type, 1]` ### `intersection_area` `intersection_area(self, other: Self) -> SIMD[type, 1]` ### `area` `area(self) -> SIMD[type, 1]` --- ## breakpoint `breakpoint()` Cause an execution trap with the intention of requesting the attention of a debugger. --- ## breakpoint This module includes the builtin breakpoint function. ## Functions * [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/breakpoint): Cause an execution trap with the intention of requesting the attention of a debugger. --- ## breakpointhook `breakpointhook()` Cause an execution trap with the intention of requesting the attention of a debugger. --- ## Bring your own fine-tuned model to MAX pipelines import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import GetMagic from '@site/src/includes/get_magic.mdx'; In the [MAX 24.4](https://www.modular.com/blog/whats-new-in-max-24-4-max-on-macos-fast-local-llama3-native-quantization-and-gguf-support) release, we have introduced native support for quantization and GGUF weight format. In this tutorial, we'll guide you through the steps to integrate your fine-tuned custom model into the MAX pipelines. More specifically, we will start with the initial configuration and then demonstrate how to download a model from the Hugging Face Hub. If the model is not already available in a supported quantized GGUF format, we'll show you how to convert it to prepare for ingestion into the MAX pipelines. Finally, we will explore how to use the quantized GGUF model via the MAX pipelines CLI. ## About model customization Model customization in machine learning typically involves modifying a pre-trained model to better suit specific tasks or datasets. One effective approach is fine-tuning, where a model trained on a large dataset is further trained (or fine-tuned) on a smaller, task-specific dataset. In this tutorial, we focus on [Low Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685). LoRA (and its quantized variant [QLoRA](https://arxiv.org/abs/2305.14314)) allows for efficient adaptation of large models by only updating a small set of additional parameters, preserving the original model's structure by integrating LoRA layers without altering the primary architecture. For this tutorial, we are assuming the LoRA weights have been merged into the original model such as **Llama3.1**. Such a functionality is provided by major fine-tuning libraries such as [unsloth `save_pretrained_merged`](https://docs.unsloth.ai/basics/saving-models/saving-to-gguf) or using [PEFT model merging](https://huggingface.co/docs/peft/en/developer_guides/model_merging) APIs. ## Step 1: Set up Hugging Face access To interact with models hosted on Hugging Face, secure access is required either via SSH or an access token. Follow the instructions in the [Hugging Face documentation](https://huggingface.co/docs/hub/en/security-git-ssh) to set up SSH. We can verify our configuration by running: ```sh ssh -T git@hf.co ``` A successful setup will display `Hi , welcome to Hugging Face`. ## Step 2: Set up MAX pipelines Next is to clone the [MAX GitHub repository](https://github.com/modular/modular) and navigate to the MAX pipeline: ```sh git clone -b stable https://github.com/modular/modular && cd max cd src/max ``` ## Step 3: Include the `huggingface_hub` CLI We'll use the `magic` CLI to create a virtual environment and install the required packages. Now install the `huggingface_hub` library to enable interactions with the Hugging Face Hub. This package facilitates the download, and management of models and datasets: ```sh magic add --pypi huggingface_hub hf_transfer ``` With the Hugging Face Hub CLI installed, we can proceed to the next steps of downloading and converting our model. ## Step 3: Convert to GGUF format If your model is already in the [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), you can skip this conversion step and proceed directly to the next step. If not, here are the most common methods to convert a model to a quantized GGUF format suitable for deployment: - **Automated conversion via Hugging Face space**: We can use the [gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space for a streamlined conversion process to convert to a supported quantized GGUF format. Remember to log in and for the sake of this tutorial, we choose the `Q4_K_M` quantization method. You can see all the supported quantization encoding in the [`encodings` module](/max/api/mojo/graph/quantization/encodings). For demonstration, we will choose [mlabonne/FineLlama-3.1-8B](https://huggingface.co/mlabonne/FineLlama-3.1-8B). After conversion, the model will be available under your HugginFace USERNAME, ready for download and deployment. ![](images/max-pipeline-bring-your-own-model/gguf-my-repo.png) The following will download the converted GGUF model: ```sh HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download \ /FineLlama-3.1-8B-Q4_K_M-GGUF \ --repo-type model \ --local-dir ./models ``` - **Manually convert via llama.cpp script**: Alternatively, utilize the [llama.cpp converter script](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) to manually convert your model. ```sh git clone https://github.com/ggerganov/llama.cpp # If your model is available in Hugging Face . # Ensure you replace with the appropriate # repository or model ID from Hugging Face. # Otherwise skip this command. HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download \ --repo-type model \ --local-dir ./models python llama.cpp/convert_hf_to_gguf.py models ``` With all the requirements in place we are now ready to use our custom model in MAX pipelines. ## Step 4: Run the custom model With our fine-tuned Llama 3.1 model successfully converted to GGUF format, we're ready to put it into action using MAX pipelines. For this demonstration, we'll be using our converted model file `finellama-3.1-8b-q4_k_m.gguf`. First, let's install the necessary CLI tool. MAX provides the `max` package, which we can easily install using the `magic` command: ```bash magic global install max ``` Before running our model, it's worth noting that MAX pipelines offer various configuration options. You can explore these by running, `max --help` for the available options. :::note If you use private or gated models, you must set your [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) first. For example: ```bash export HF_TOKEN="hf_..." ``` Then you can run a MAX Pipelines command for a private or gated model. ::: Now, let's run our custom model. We'll use the `max generate` command, specifying our model configuration and a test prompt: ```bash max generate \ --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \ --quantization-encoding "q4_k" \ --weight-path "./models/finellama-3.1-8b-q4_k_m.gguf" \ --prompt "What is the meaning of life?" ``` It generates the following answer: ```output The meaning of life is a question that has been pondered by philosophers, scientists, and spiritual leaders for centuries. It is a question that has no definitive answer, as it is deeply personal and subjective to each individual. However, many have attempted to provide their own interpretations or explanations. One interpretation of the meaning of life is that it is simply to live and experience the world around us. This view suggests that the purpose of life is to experience all that it has to offer, whether it be through the senses, emotions, or intellectual pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal or achievement, but rather to the process of living itself. Another interpretation is that the meaning of life is to find purpose and meaning in our lives. This view suggests that we are here to seek out our own unique purpose and to strive to achieve it. This can be achieved through various means, such as through our work, relationships, or personal pursuits. A third interpretation is that the meaning of life is to connect with something larger than ourselves. This view suggests that we are here to connect with a higher power, whether it be through religion, spirituality, or a sense of awe and wonder at the universe. In this sense, the meaning of life is to find a sense of purpose and connection that transcends our individual lives. Ultimately, the meaning of life is a question that each person must answer for themselves. It is a question that requires us to reflect on our own values, beliefs, and experiences. As the saying goes, "Ask a flower" - the meaning of life is not something that can be answered in words, but rather in the experience of living itself. ``` ## Next steps Congratulations on successfully integrating your fine-tuned Llama3.1 model into the MAX pipelines! 🎉 We have navigated through setting up secure access, downloading and converting models, and finally running your custom model in MAX pipelines. We encourage you to further customize your models via the MAX Graph API, test your pipeline and explore other MAX features including how to **deploy your fine-tuned model on GPU using MAX Serve**. Here are some other topics to explore next: import TutorialStack from '@site/src/components/TutorialStack'; export const maxTutorials = [ 'get-started-with-max-graph-in-python', 'max-serve-local-to-cloud', ]; --- ## broadcast `broadcast[type: DType](output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. **Args:** * ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. --- ## broadcast ## Functions * [​`broadcast`](./broadcast): For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. * [​`broadcast_impl`](./broadcast_impl): For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. --- ## broadcast `broadcast[type: DType, width: Int, //, *, block_size: Int](val: SIMD[type, width], src_thread: UInt = UInt(0)) -> SIMD[type, width]` Broadcasts a value from a source thread to all threads in a block. This function takes a SIMD value from the specified source thread and copies it to all other threads in the block, effectively broadcasting the value across the entire block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to broadcast from the source thread. * ​src\_thread (`UInt`): The thread ID of the source thread (default: 0). **Returns:** A SIMD value where all threads contain a copy of the input value from the source thread. --- ## broadcast `broadcast[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Broadcasts a SIMD value from lane 0 to all lanes in the warp. This function takes a SIMD value from lane 0 and copies it to all other lanes in the warp, effectively broadcasting the value across the entire warp. This is useful for sharing data between threads in a warp without using shared memory. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to broadcast from lane 0. **Returns:** A SIMD value where all lanes contain a copy of the input value from lane 0. `broadcast(val: Int) -> Int` Broadcasts an integer value from lane 0 to all lanes in the warp. This function takes an integer value from lane 0 and copies it to all other lanes in the warp. It provides a convenient way to share scalar integer data between threads without using shared memory. **Args:** * ​val (`Int`): The integer value to broadcast from lane 0. **Returns:** The broadcast integer value, where all lanes receive a copy of the input from lane 0. `broadcast(val: UInt) -> UInt` Broadcasts an unsigned integer value from lane 0 to all lanes in the warp. This function takes an unsigned integer value from lane 0 and copies it to all other lanes in the warp. It provides a convenient way to share scalar unsigned integer data between threads without using shared memory. **Args:** * ​val (`UInt`): The unsigned integer value to broadcast from lane 0. **Returns:** The broadcast unsigned integer value, where all lanes receive a copy of the input from lane 0. --- ## broadcast_impl `broadcast_impl[type: DType](axis: Int, output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_prev_axis_stride: Int, output_prev_axis_stride: Int, input_offset: Int, output_offset: Int, rightmost_broadcast_axis: Int)` For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`. **Args:** * ​axis (`Int`): The axis value. * ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. * ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer. * ​input\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for input. * ​output\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for output. * ​input\_offset (`Int`): The offset at which we start copying data from. * ​output\_offset (`Int`): The offset at which we start copying data to. * ​rightmost\_broadcast\_axis (`Int`): The largest axis at which we need to duplicate `input` data. --- ## BTileGenerator `struct BTileGenerator[mut: Bool, //, config: KernelConfig, a_type: DType, b_type: DType, c_type: DType, shape: DimList, transpose_b: Bool, b_packed: Bool, origin: Origin[mut]]` Struct to encapsulate a tile of B that supports prepacking. If b\_packed is true, calls to get\_tile will return a buffer view from B. Otherwise, calls to get\_tile will copy a tile from B into a stack allocated scratch buffer and return a view of that. ## Fields * ​b (`NDBuffer[b_type, 2, origin, shape]`): * ​b\_tile\_stack\_ptr (`UnsafePointer[SIMD[b_type, 1]]`): * ​tile\_n\_k (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `get` `static get(b: NDBuffer[b_type, 2, origin, shape], tile_n_k: IndexList[2]) -> Self` ### `get_tile` `get_tile[inner_size: Int](self, global_offset: GemmShape, tile_dim_nk: IndexList[2], valid_data_dim_nk: IndexList[2]) -> NDBuffer[b_type, 3, MutableAnyOrigin, config.packed_shape]` Get a packed matrix (B) tile. valid\_data\_tile\_nk is ignored for pre-packing, where the tile is padded to have shape of tile\_dim\_nk. **Args:** * ​global\_offset (`GemmShape`): Offset in the global M, N, K dimensions. * ​tile\_dim\_nk (`IndexList[2]`): Tile shape based on cache size and matrix dimensions. * ​valid\_data\_dim\_nk (`IndexList[2]`): The upper bounds for N and K dimensions. **Returns:** A view of the packed tile. --- ## buffer Implements the NDBuffer struct. You can import these APIs from the `memory` package. For example: ```mojo from buffer import NDBuffer ``` ## Structs * [​`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer): An N-dimensional buffer. ## Functions * [​`partial_simd_load`](/mojo/stdlib/buffer/buffer/partial_simd_load): Loads a vector with dynamic bound. * [​`partial_simd_store`](/mojo/stdlib/buffer/buffer/partial_simd_store): Stores a vector with dynamic bound. * [​`prod_dims`](/mojo/stdlib/buffer/buffer/prod_dims): Computes the product of a slice of the given buffer's dimensions. --- ## buffer Implements the buffer package. ## Modules * [​`buffer`](/mojo/stdlib/buffer/buffer/): Implements the NDBuffer struct. * [​`dimlist`](/mojo/stdlib/buffer/dimlist/): Provides utilities for working with static and variadic lists. --- ## buffer_load `buffer_load[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1]) -> SIMD[type, width]` Loads data from global memory into a SIMD register. This function provides a hardware-accelerated global memory load operation that maps directly to the AMDGPU buffer\_load instruction. It efficiently transfers data from global memory to registers. Note: * Only supported on AMD GPUs. * Uses non-glc loads by default (can hit L1 cache and persist across wavefronts). * Supports widths that map to 1, 2, 4, 8, or 16 byte loads. * Maps directly to llvm.amdgcn.raw\.buffer.load intrinsics. **Parameters:** * ​type (`DType`): The data type to load. * ​width (`Int`): The SIMD vector width for vectorized loads. **Args:** * ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor created by make\_buffer\_resource(). * ​gds\_offset (`SIMD[int32, 1]`): Offset in elements (not bytes) from the base address in the resource. **Returns:** SIMD vector containing the loaded data. --- ## buffer_load_store_lds `buffer_load_store_lds[type: DType](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], lds_ptr_base: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], lds_offset: SIMD[int32, 1])` Loads four bytes from global memory ands writes them to shared memory. Copies from global memory to shared memory (aka LDS) bypassing storing to register. **Parameters:** * ​type (`DType`): The type of the data to be loaded. **Args:** * ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor from make\_buffer\_resource. * ​gds\_offset (`SIMD[int32, 1]`): Global memory offset. * ​lds\_ptr\_base (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): LDS base address. * ​lds\_offset (`SIMD[int32, 1]`): LDS offset. --- ## buffer_store `buffer_store[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], val: SIMD[type, width])` Stores a register variable to global memory. Writes to global memory from a register. **Parameters:** * ​type (`DType`): The data type. * ​width (`Int`): The SIMD vector width. **Args:** * ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor. * ​gds\_offset (`SIMD[int32, 1]`): Global memory offset. * ​val (`SIMD[type, width]`): Value to write. --- ## BufferValue ## `BufferValue` {#max.graph.BufferValue} > *class* max.graph.BufferValue(value) Bases: [`Value`](Value.md#max.graph.Value)\[`BufferType`] Represents a mutable semantic tensor within a Graph. Value is abstract, it shouldn’t be constructed directly. **Parameters:** **value** ([`Value`](Value.md#max.graph.Value) `|` `\_Value` `[` `mo.BufferType` `]` ) ### `device` {#max.graph.BufferValue.device} > *property* device\*: DeviceRef\* Returns the device of the BufferValue. ### `dtype` {#max.graph.BufferValue.dtype} > *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\* Returns the tensor data type. ### `print()` {#max.graph.BufferValue.print} > print(label='debug\_buffer') **Parameters:** **label** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) ### `rank` {#max.graph.BufferValue.rank} > *property* rank\*: [int](https://docs.python.org/3/library/functions.html#int)\* Returns the rank (number of dims) of the buffer. ### `shape` {#max.graph.BufferValue.shape} > *property* shape\*: [Shape](type.md#max.graph.type.Shape)\* Returns the shape of the BufferValue. ### `type` {#max.graph.BufferValue.type} > *property* type\*: BufferType\* Returns the type of the [`BufferValue`](#max.graph.BufferValue) as a `BufferType`. --- ## Build custom ops for GPUs import GetMagic from "@site/src/includes/get_magic.mdx"; [Mojo](/mojo/manual/index.md) is our not-so-secret weapon for achieving architecture-independent performance for all types of AI workloads. Previously, only Modular engineers were able to write high-performance parallel processing operations for a [MAX Graph](/max/model-formats.mdx#max-graph) using Mojo. In this tutorial, you'll learn how to write custom operations (custom ops) for MAX graphs using Mojo that can execute efficiently on both CPUs and GPUs. You'll execute a graph with a custom operation and learn to create a matrix addition operation that adds one to each matrix element. To help you get started, we provide several [Custom Operations recipes](https://github.com/modular/max-recipes/tree/main/custom-ops-introduction) that you can run with the nightly version of MAX. ## Set up your environment Using a virtual environment ensures that you have the MAX and Mojo version that's compatible with this project. We'll use the [Magic CLI](/magic) to create the environment and install the required packages. 1. 2. Create a new project with the `custom-ops-introduction` recipe: ```sh magic init max-custom-ops --from modular/max-recipes/custom-ops-introduction && \ cd max-custom-ops ``` 3. You can run the custom addition operation example like this: ```sh magic run add_one ``` And the following is the expected output: ```output Graph result: [[1.7736697 1.4688652 1.7971799 1.4553597 1.8967733 1.3691401 1.1297637 1.7047229 1.1314526 1.3924606] # ... shorten for brevity Expected result: [[1.7736697 1.4688652 1.7971799 1.4553597 1.8967733 1.3691401 1.1297637 1.7047229 1.1314526 1.3924606] # ... shorten for brevity ``` The exact output will vary based on random initialization of the input tensor. But the graph result and expected result should be the same. Now that you've seen the code in action, let's dive into the implementation details to understand how this custom addition operation works under the hood. ## Define a Mojo custom operation The MAX Graph API represents models as computational graphs, where each operation describes parallel computations that the MAX Engine optimizes for hardware performance. Within these graphs, nodes can process any number of input tensors, perform computations on the target hardware, and generate one or more output tensors as results. To illustrate this, open the `add_custom.mojo` file in the [kernels](https://github.com/modular/modular/tree/main/examples/custom_ops/kernels) directory. Here, a custom operation called `AddOneCustom` takes an input tensor, adds one to every element, and returns the result of that computation as a new tensor. This custom compute node is defined as a Mojo struct: ```mojo import compiler from tensor import OutputTensor, InputTensor, foreach from runtime.asyncrt import DeviceContextPtr from utils.index import IndexList @compiler.register("add_one") struct AddOne: @staticmethod fn execute[ target: StaticString, ]( out: OutputTensor, x: InputTensor[type = out.type, rank = out.rank], ctx: DeviceContextPtr, ) raises: ``` The [`@compiler.register()`](/max/api/mojo-decorators/compiler-register) decorator is used to register the custom operation with the name `add_one` and specify that it produces one output. Mojo's [Single Instruction Multiple Data (SIMD)](/mojo/stdlib/builtin/simd/SIMD.md) types and compile-time parameters enable hardware-agnostic parallel processing. Inputs and outputs take the form of `InputTensor` and `OutputTensor`, respectively. These are both specialized versions of the [`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice), type, which represents a tensor of a specific rank and datatype whose memory is managed outside of the operation. Elements are read from the input tensors and written directly into the output tensors. Any output tensors must come first in the operation signature. The core computation, adding one to each element in the tensor, happens in the `add_one()` function: ```Mojo @parameter @always_inline fn elementwise_add_one[ width: Int ](idx: IndexList[x.rank]) -> SIMD[x.type, width]: return x.load[width](idx) + 1 foreach[elementwise_add_one, target=target](out, ctx) ``` The [`foreach()`](/max/api/mojo/tensor/managed_tensor_slice/foreach/) function distributes an elementwise computation in parallel across all elements in the output tensor. This method is optimized for specific hardware platforms, optimally distributing parallel workloads to make the most efficient use of computational resources. A library of these custom operations can be defined in Mojo files (`.mojo`) and used directly by the graph compiler when defining a MAX Graph. These Mojo files contain the custom operations that will be used in your MAX Graph. ## Add the custom operation to a graph The MAX Graph API contains [a series of pre-defined operations](/max/api/mojo/graph/ops/index.md) written by Modular that have highly optimized implementations. In addition to those APIs, the [`custom()`](/max/api/python/graph/ops#max.graph.ops.custom) function allows you to specify custom user-defined Mojo operations. To use a Mojo custom operation with GPU acceleration, specify the custom ops in your MAX graph. The [`add_one.py`](https://github.com/modular/max-recipes/blob/main/custom-ops-introduction/add_one.py) example demonstrates building a computational graph in Python: ```python import os from pathlib import Path import numpy as np from max.driver import CPU, Accelerator, Tensor, accelerator_count from max.dtype import DType from max.engine import InferenceSession from max.graph import Graph, TensorType, ops if __name__ == "__main__": path = Path(__file__).parent / "operations.mojopkg" rows = 5 columns = 10 dtype = DType.float32 # Configure our simple one-operation graph. graph = Graph( "addition", forward=lambda x: ops.custom( name="add_one", values=[x], out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)], )[0].tensor, input_types=[ TensorType(dtype, shape=[rows, columns]), ], ) ``` The [`Graph()`](/max/api/python/graph/Graph.md) takes an input tensor with five rows and ten columns, runs the custom `add_one` operation on it, and returns the result. The custom operation is specified using the `ops.custom()` function, which requires the operation name, input values, and output tensor types. Because MAX works across a range of hardware architectures, this same code can be run on a GPU if it is available, or a local CPU if not. For example: ```python device = CPU() if accelerator_count() == 0 else Accelerator() ``` Using the `InferenceSession()` class, this graph is placed on whatever device we've selected: ```python session = InferenceSession( devices=[device], custom_extensions=path, ) ``` This configures the inference session to run on the detected compute type. After which MAX Engine can compile it to optimize for the target hardware: ```python model = session.load(graph) ``` Memory management between host CPUs and accelerator devices is handled through the MAX Driver API. This interface gives you precise control over memory transfers, allowing you to optimize performance by explicitly managing these potentially expensive operations. The API's [`Tensor`](/max/api/python/driver/#tensor-1) class is designed for seamless integration with common Python frameworks - it offers zero-copy interoperability with both NumPy arrays and PyTorch tensors. Here's how we can leverage this to create a MAX Tensor from random data: ```python x_array = np.random.uniform(size=(rows, columns)).astype(np.float32) x = Tensor.from_numpy(x_array) ``` This Tensor is resident on the host and needs to be moved to the accelerator to be ready for use with the MAX Graph on that device. Note that if the device is the host CPU, this is a no-op: ```python x = x.to(device) ``` This Tensor can now be run through our compiled graph, and a device-resident tensor is the result: ```python result = model.execute(x)[0] ``` To examine the results, this Tensor can be moved back to the host: ```python result = result.to(CPU()) ``` Then you can convert it back to a NumPy array: ```python print(result.to_numpy()) ``` For a more advanced example, be sure to check out how we compute the [Mandelbrot set](https://github.com/modular/modular/tree/main/examples/custom_ops) using the [`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD.md) data type and a vectorized implementation of the fractal computation. As a final note, the programming interface described above is being provided as a preview, and some elements will change as we continue to improve [GPU programming with Mojo](/mojo/manual/gpu/basics). ## More to come Mojo is an incredible language for programming accelerators: Python-like high-level syntax, systems language performance, and unique language features designed for modern heterogeneous computation. We're tremendously excited to be able to show off how it enables MAX to drive forward the state-of-the-art when running AI workloads and more on GPUs. Adding custom ops to a graph is our first introduction to how you can program GPUs with Mojo. These are early examples, and we will be rolling out more API documentation and examples. To stay up to date with new releases, [sign up for our newsletter](https://www.modular.com/modverse#signup), [check out the community](https://www.modular.com/community), and [join our forum](https://forum.modular.com/). The nightly branch of the open-source MAX repository contains everything needed to run the examples above on an Ampere- or Lovelace-class NVIDIA GPU (more to come!), as well as on a local CPU. Give them a try today to start experimenting with programming GPUs in Mojo! ## Next steps import TutorialStack from '@site/src/components/TutorialStack'; export const maxTutorials = [ 'get-started-with-max-graph-in-python', 'magic', ]; export const mojoTutorials = [ 'get-started', ]; --- ## builtin Implements the builtin package. ## Modules * [​`anytype`](/mojo/stdlib/builtin/anytype/): Defines the core traits for object lifetime management in Mojo. * [​`bool`](/mojo/stdlib/builtin/bool/): Implements the Bool class. * [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/): This module includes the builtin breakpoint function. * [​`builtin_slice`](/mojo/stdlib/builtin/builtin_slice/): Implements slice. * [​`comparable`](/mojo/stdlib/builtin/comparable/): * [​`constrained`](/mojo/stdlib/builtin/constrained/): Implements compile-time constraints. * [​`coroutine`](/mojo/stdlib/builtin/coroutine/): Implements classes and methods for coroutines. * [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/): Implements run-time assertions. * [​`device_passable`](/mojo/stdlib/builtin/device_passable/): * [​`dtype`](/mojo/stdlib/builtin/dtype/): Implements the DType class. * [​`equality_comparable`](/mojo/stdlib/builtin/equality_comparable/): * [​`error`](/mojo/stdlib/builtin/error/): Implements the Error class. * [​`file`](/mojo/stdlib/builtin/file/): Provides APIs to read and write files. * [​`file_descriptor`](/mojo/stdlib/builtin/file_descriptor/): Higher level abstraction for file stream. * [​`float_literal`](/mojo/stdlib/builtin/float_literal/): Implements the FloatLiteral class. * [​`floatable`](/mojo/stdlib/builtin/floatable/): Implements the `Floatable` and `FloatableRaising` traits. * [​`format_int`](/mojo/stdlib/builtin/format_int/): Provides the `hex` and `bin` functions. * [​`identifiable`](/mojo/stdlib/builtin/identifiable/): * [​`int`](/mojo/stdlib/builtin/int/): Implements the Int class. * [​`int_literal`](/mojo/stdlib/builtin/int_literal/): Implements the IntLiteral class. * [​`io`](/mojo/stdlib/builtin/io/): Provides utilities for working with input/output. * [​`len`](/mojo/stdlib/builtin/len/): Provides the `len()` function and its associated traits. * [​`math`](/mojo/stdlib/builtin/math/): Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library. * [​`none`](/mojo/stdlib/builtin/none/): Defines the builtin `NoneType`. * [​`range`](/mojo/stdlib/builtin/range/): Implements a 'range' call. * [​`rebind`](/mojo/stdlib/builtin/rebind/): Implements type rebind. * [​`repr`](/mojo/stdlib/builtin/repr/): Provide the `repr` function. * [​`reversed`](/mojo/stdlib/builtin/reversed/): Provides the `reversed` function for reverse iteration over collections. * [​`simd`](/mojo/stdlib/builtin/simd/): Implements SIMD primitives and abstractions. * [​`sort`](/mojo/stdlib/builtin/sort/): Implements the built-in `sort` function. * [​`str`](/mojo/stdlib/builtin/str/): Provides the `str` function. * [​`string_literal`](/mojo/stdlib/builtin/string_literal/): Implements the StringLiteral struct. * [​`swap`](/mojo/stdlib/builtin/swap/): Implements the built-in `swap` function. * [​`tuple`](/mojo/stdlib/builtin/tuple/): Implements the Tuple type. * [​`type_aliases`](/mojo/stdlib/builtin/type_aliases/): Defines some type aliases. * [​`uint`](/mojo/stdlib/builtin/uint/): Implements the UInt class. * [​`value`](/mojo/stdlib/builtin/value/): Defines core value traits. * [​`variadics`](/mojo/stdlib/builtin/variadics/): Implements the VariadicList and VariadicPack types. --- ## builtin_slice Implements slice. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice): Represents a slice expression. ## Functions * [​`slice`](/mojo/stdlib/builtin/builtin_slice/slice-function): Construct slice given the end value. --- ## byte_permute `byte_permute(a: SIMD[uint32, 1], b: SIMD[uint32, 1], c: SIMD[uint32, 1]) -> SIMD[uint32, 1]` Permutes bytes from two 32-bit integers based on a control mask. Selects and rearranges bytes from two source integers based on a control mask to create a new 32-bit value. Note: Byte selection behavior depends on the GPU architecture: * On NVIDIA: Maps to PRMT instruction * On AMD: Maps to PERM instruction. **Args:** * ​a (`SIMD[uint32, 1]`): First source integer containing bytes to select from. * ​b (`SIMD[uint32, 1]`): Second source integer containing bytes to select from. * ​c (`SIMD[uint32, 1]`): Control mask that specifies which bytes to select and their positions. Each byte in the mask controls selection/placement of one output byte. **Returns:** A new 32-bit integer containing the selected and rearranged bytes --- ## byte_swap `byte_swap(val: Int) -> Int` Byte-swaps an integer value with an even number of bytes. Byte swap an integer value (8 bytes) with an even number of bytes (positive multiple of 16 bits). This returns an integer value (8 bytes) that has its bytes swapped. For example, if the input bytes are numbered 0, 1, 2, 3, 4, 5, 6, 7 then the returned integer will have its bytes in 7, 6, 5, 4, 3, 2, 1, 0 order. **Args:** * ​val (`Int`): The input value. **Returns:** The input value with its bytes swapped. `byte_swap[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Byte-swaps a SIMD vector of integer values with an even number of bytes. Byte swap an integer value or vector of integer values with an even number of bytes (positive multiple of 16 bits). For example, The Int16 returns an Int16 value that has the high and low byte of the input Int16 swapped. Similarly, Int32 returns an Int32 value that has the four bytes of the input Int32 swapped, so that if the input bytes are numbered 0, 1, 2, 3 then the returned Int32 will have its bytes in 3, 2, 1, 0 order. Int64 and other integer type extend this concept to additional even-byte lengths (6 bytes, 8 bytes and more, respectively). **Constraints:** The element type of the input vector must be an integral type. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is the value of the element at position `i` of the input value with its bytes swapped. --- ## C API You can use the following C APIs to run inference with MAX Engine. ## API headers Each of the following pages represents one of the C API header files: * [Common](common.md) * [`M_version()`](common.md#_CPPv49M_versionv) * [`M_newStatus()`](common.md#_CPPv411M_newStatusv) * [`M_getError()`](common.md#_CPPv410M_getErrorPK8M_Status) * [`M_isError()`](common.md#_CPPv49M_isErrorPK8M_Status) * [`M_freeStatus()`](common.md#_CPPv412M_freeStatusP8M_Status) * [`M_sizeOf()`](common.md#_CPPv48M_sizeOf7M_Dtype) * [`M_getDynamicDimensionValue()`](common.md#_CPPv426M_getDynamicDimensionValuev) * [`M_getDynamicRankValue()`](common.md#_CPPv421M_getDynamicRankValuev) * [Context](context.md) * [`M_newRuntimeConfig()`](context.md#_CPPv418M_newRuntimeConfigv) * [`M_setNumThreads()`](context.md#_CPPv415M_setNumThreadsP15M_RuntimeConfig6size_t) * [`M_setAllocatorType()`](context.md#_CPPv418M_setAllocatorTypeP15M_RuntimeConfig15M_AllocatorType) * [`M_setCPUAffinity()`](context.md#_CPPv416M_setCPUAffinityP15M_RuntimeConfigb) * [`M_getNumThreads()`](context.md#_CPPv415M_getNumThreadsP15M_RuntimeConfig) * [`M_getCPUAffinity()`](context.md#_CPPv416M_getCPUAffinityP15M_RuntimeConfig) * [`M_enableCrashLog()`](context.md#_CPPv416M_enableCrashLogP15M_RuntimeConfigPKc) * [`M_freeRuntimeConfig()`](context.md#_CPPv419M_freeRuntimeConfigP15M_RuntimeConfig) * [`M_newRuntimeContext()`](context.md#_CPPv419M_newRuntimeContextPK15M_RuntimeConfigP8M_Status) * [`M_freeRuntimeContext()`](context.md#_CPPv420M_freeRuntimeContextP16M_RuntimeContext) * [`M_setDebugPrintOptions()`](context.md#_CPPv422M_setDebugPrintOptionsP16M_RuntimeContext19M_ResultOutputStylejPKc) * [`M_setMojoDefineBool()`](context.md#_CPPv419M_setMojoDefineBoolP16M_RuntimeContextPKcb) * [`M_setMojoDefineInt()`](context.md#_CPPv418M_setMojoDefineIntP16M_RuntimeContextPKci) * [`M_setMojoDefineString()`](context.md#_CPPv421M_setMojoDefineStringP16M_RuntimeContextPKcPKc) * [Model](model.md) * [`M_newCompileConfig()`](model.md#_CPPv418M_newCompileConfigv) * [`M_cloneCompileConfig()`](model.md#_CPPv420M_cloneCompileConfigP15M_CompileConfig) * [`M_setModelPath()`](model.md#_CPPv414M_setModelPathP15M_CompileConfigPKc) * [`M_newModelSource()`](model.md#_CPPv416M_newModelSourcePv17M_FrameworkFormat) * [`M_setModelSource()`](model.md#_CPPv416M_setModelSourceP15M_CompileConfigP13M_ModelSource) * [`M_compileModel()`](model.md#_CPPv414M_compileModelPK16M_RuntimeContextPP15M_CompileConfigP8M_Status) * [`M_waitForCompilation()`](model.md#_CPPv420M_waitForCompilationP20M_AsyncCompiledModelP8M_Status) * [`M_compileModelSync()`](model.md#_CPPv418M_compileModelSyncPK16M_RuntimeContextPP15M_CompileConfigP8M_Status) * [`M_initModel()`](model.md#_CPPv411M_initModelPK16M_RuntimeContextPK20M_AsyncCompiledModelPK17M_WeightsRegistryP8M_Status) * [`M_getInputNames()`](model.md#_CPPv415M_getInputNamesPK20M_AsyncCompiledModelP8M_Status) * [`M_getOutputNames()`](model.md#_CPPv416M_getOutputNamesPK20M_AsyncCompiledModelP8M_Status) * [`M_getTensorNameAt()`](model.md#_CPPv417M_getTensorNameAtPK17M_TensorNameArray6size_t) * [`M_getModelInputSpecByName()`](model.md#_CPPv425M_getModelInputSpecByNamePK20M_AsyncCompiledModelPKcP8M_Status) * [`M_getModelOutputSpecByName()`](model.md#_CPPv426M_getModelOutputSpecByNamePK20M_AsyncCompiledModelPKcP8M_Status) * [`M_waitForModel()`](model.md#_CPPv414M_waitForModelP12M_AsyncModelP8M_Status) * [`M_executeModelSync()`](model.md#_CPPv418M_executeModelSyncPK16M_RuntimeContextP12M_AsyncModelP16M_AsyncTensorMapP8M_Status) * [`M_getNumModelInputs()`](model.md#_CPPv419M_getNumModelInputsPK20M_AsyncCompiledModelP8M_Status) * [`M_getNumModelOutputs()`](model.md#_CPPv420M_getNumModelOutputsPK20M_AsyncCompiledModelP8M_Status) * [`M_validateInputTensorSpec()`](model.md#_CPPv425M_validateInputTensorSpecPK20M_AsyncCompiledModelP16M_AsyncTensorMapP8M_Status) * [`M_freeModel()`](model.md#_CPPv411M_freeModelP12M_AsyncModel) * [`M_freeCompiledModel()`](model.md#_CPPv419M_freeCompiledModelP20M_AsyncCompiledModel) * [`M_freeCompileConfig()`](model.md#_CPPv419M_freeCompileConfigP15M_CompileConfig) * [`M_freeModelSource()`](model.md#_CPPv417M_freeModelSourceP13M_ModelSource) * [`M_exportCompiledModel()`](model.md#_CPPv421M_exportCompiledModelP20M_AsyncCompiledModelPKcP8M_Status) * [Tensor](tensor.md) * [`M_newTensorSpec()`](tensor.md#_CPPv415M_newTensorSpecPK7int64_t7int64_t7M_DtypePKc) * [`M_isDynamicRanked()`](tensor.md#_CPPv417M_isDynamicRankedPK12M_TensorSpec) * [`M_getDimAt()`](tensor.md#_CPPv410M_getDimAtPK12M_TensorSpec6size_t) * [`M_getRank()`](tensor.md#_CPPv49M_getRankPK12M_TensorSpec) * [`M_getDtype()`](tensor.md#_CPPv410M_getDtypePK12M_TensorSpec) * [`M_getName()`](tensor.md#_CPPv49M_getNameP12M_TensorSpec) * [`M_newAsyncTensorMap()`](tensor.md#_CPPv419M_newAsyncTensorMapPK16M_RuntimeContext) * [`M_copyAsyncTensorMap()`](tensor.md#_CPPv420M_copyAsyncTensorMapPK16M_AsyncTensorMap) * [`M_getTensorMapSize()`](tensor.md#_CPPv418M_getTensorMapSizePK16M_AsyncTensorMapP8M_Status) * [`M_borrowTensorInto()`](tensor.md#_CPPv418M_borrowTensorIntoP16M_AsyncTensorMapPKvPK12M_TensorSpecP8M_Status) * [`M_createBorrowedTensor()`](tensor.md#_CPPv422M_createBorrowedTensorPKvPK12M_TensorSpecP16M_RuntimeContext) * [`M_getTensorByNameFrom()`](tensor.md#_CPPv421M_getTensorByNameFromP16M_AsyncTensorMapPKcP8M_Status) * [`M_tensorMapKeys()`](tensor.md#_CPPv415M_tensorMapKeysP16M_AsyncTensorMapP7int64_t) * [`M_deleteTensorMapKeys()`](tensor.md#_CPPv421M_deleteTensorMapKeysPPKc) * [`M_getTensorFromValue()`](tensor.md#_CPPv420M_getTensorFromValueP12M_AsyncValue) * [`M_getTensorNumElements()`](tensor.md#_CPPv422M_getTensorNumElementsPK13M_AsyncTensor) * [`M_getTensorType()`](tensor.md#_CPPv415M_getTensorTypePK13M_AsyncTensor) * [`M_getTensorData()`](tensor.md#_CPPv415M_getTensorDataPK13M_AsyncTensor) * [`M_getTensorSpec()`](tensor.md#_CPPv415M_getTensorSpecPK13M_AsyncTensor) * [`M_getTensorMapIterator()`](tensor.md#_CPPv422M_getTensorMapIteratorP16M_AsyncTensorMapP8M_Status) * [`M_advanceTensorMapIterator()`](tensor.md#_CPPv426M_advanceTensorMapIteratorP19M_TensorMapIterator) * [`M_getNameFromMapIterator()`](tensor.md#_CPPv424M_getNameFromMapIteratorP19M_TensorMapIterator) * [`M_getTensorFromMapIterator()`](tensor.md#_CPPv426M_getTensorFromMapIteratorP19M_TensorMapIterator) * [`M_isEndOfTensorMap()`](tensor.md#_CPPv418M_isEndOfTensorMapP16M_AsyncTensorMapP19M_TensorMapIterator) * [`M_freeTensor()`](tensor.md#_CPPv412M_freeTensorP13M_AsyncTensor) * [`M_freeTensorNameArray()`](tensor.md#_CPPv421M_freeTensorNameArrayP17M_TensorNameArray) * [`M_freeTensorSpec()`](tensor.md#_CPPv416M_freeTensorSpecP12M_TensorSpec) * [`M_freeAsyncTensorMap()`](tensor.md#_CPPv420M_freeAsyncTensorMapP16M_AsyncTensorMap) * [`M_freeTensorMapIterator()`](tensor.md#_CPPv423M_freeTensorMapIteratorP19M_TensorMapIterator) * [Types](types.md) * [`M_Status`](types.md#_CPPv48M_Status) * [`M_RuntimeConfig`](types.md#_CPPv415M_RuntimeConfig) * [`M_RuntimeContext`](types.md#_CPPv416M_RuntimeContext) * [`M_UInt64Counter`](types.md#_CPPv415M_UInt64Counter) * [`M_DoubleCounter`](types.md#_CPPv415M_DoubleCounter) * [`M_UInt64Histogram`](types.md#_CPPv417M_UInt64Histogram) * [`M_DoubleHistogram`](types.md#_CPPv417M_DoubleHistogram) * [`M_Int64Gauge`](types.md#_CPPv412M_Int64Gauge) * [`M_DoubleGauge`](types.md#_CPPv413M_DoubleGauge) * [`M_CustomMetricReader`](types.md#_CPPv420M_CustomMetricReader) * [`M_CompileConfig`](types.md#_CPPv415M_CompileConfig) * [`M_DeviceConfig`](types.md#_CPPv414M_DeviceConfig) * [`M_AsyncCompiledModel`](types.md#_CPPv420M_AsyncCompiledModel) * [`M_AsyncModel`](types.md#_CPPv412M_AsyncModel) * [`M_AsyncTensor`](types.md#_CPPv413M_AsyncTensor) * [`M_TensorNameArray`](types.md#_CPPv417M_TensorNameArray) * [`M_TensorSpec`](types.md#_CPPv412M_TensorSpec) * [`M_AsyncTensorMap`](types.md#_CPPv416M_AsyncTensorMap) * [`M_TensorMapIterator`](types.md#_CPPv419M_TensorMapIterator) * [`M_AsyncValue`](types.md#_CPPv412M_AsyncValue) * [`M_Config`](types.md#_CPPv48M_Config) * [`M_AsyncDict`](types.md#_CPPv411M_AsyncDict) * [`M_AsyncList`](types.md#_CPPv411M_AsyncList) * [`M_AsyncTuple`](types.md#_CPPv412M_AsyncTuple) * [`M_AsyncNone`](types.md#_CPPv411M_AsyncNone) * [`M_MaxContext`](types.md#_CPPv412M_MaxContext) * [`M_ModelSource`](types.md#_CPPv413M_ModelSource) * [`M_WeightsRegistry`](types.md#_CPPv417M_WeightsRegistry) * [`M_DevicesList`](types.md#_CPPv413M_DevicesList) * [`M_DeviceRefsList`](types.md#_CPPv416M_DeviceRefsList) * [`M_Dtype`](types.md#_CPPv47M_Dtype) * [`M_UNKNOWN`](types.md#_CPPv4N7M_Dtype9M_UNKNOWNE) * [`mIsInteger`](types.md#_CPPv4N7M_Dtype10mIsIntegerE) * [`mIsFloat`](types.md#_CPPv4N7M_Dtype8mIsFloatE) * [`mIsComplex`](types.md#_CPPv4N7M_Dtype10mIsComplexE) * [`mIsSigned`](types.md#_CPPv4N7M_Dtype9mIsSignedE) * [`kIntWidthShift`](types.md#_CPPv4N7M_Dtype14kIntWidthShiftE) * [`M_INT1`](types.md#_CPPv4N7M_Dtype6M_INT1E) * [`M_UINT1`](types.md#_CPPv4N7M_Dtype7M_UINT1E) * [`M_INT2`](types.md#_CPPv4N7M_Dtype6M_INT2E) * [`M_UINT2`](types.md#_CPPv4N7M_Dtype7M_UINT2E) * [`M_INT4`](types.md#_CPPv4N7M_Dtype6M_INT4E) * [`M_UINT4`](types.md#_CPPv4N7M_Dtype7M_UINT4E) * [`M_INT8`](types.md#_CPPv4N7M_Dtype6M_INT8E) * [`M_UINT8`](types.md#_CPPv4N7M_Dtype7M_UINT8E) * [`M_INT16`](types.md#_CPPv4N7M_Dtype7M_INT16E) * [`M_UINT16`](types.md#_CPPv4N7M_Dtype8M_UINT16E) * [`M_INT32`](types.md#_CPPv4N7M_Dtype7M_INT32E) * [`M_UINT32`](types.md#_CPPv4N7M_Dtype8M_UINT32E) * [`M_INT64`](types.md#_CPPv4N7M_Dtype7M_INT64E) * [`M_UINT64`](types.md#_CPPv4N7M_Dtype8M_UINT64E) * [`M_INT128`](types.md#_CPPv4N7M_Dtype8M_INT128E) * [`M_UINT128`](types.md#_CPPv4N7M_Dtype9M_UINT128E) * [`M_FLOAT8_E3M4`](types.md#_CPPv4N7M_Dtype13M_FLOAT8_E3M4E) * [`M_FLOAT8_E4M3`](types.md#_CPPv4N7M_Dtype13M_FLOAT8_E4M3E) * [`M_FLOAT8_E4M3FN`](types.md#_CPPv4N7M_Dtype15M_FLOAT8_E4M3FNE) * [`M_FLOAT8_E4M3FNUZ`](types.md#_CPPv4N7M_Dtype17M_FLOAT8_E4M3FNUZE) * [`M_FLOAT8_E5M2`](types.md#_CPPv4N7M_Dtype13M_FLOAT8_E5M2E) * [`M_FLOAT8_E5M2FNUZ`](types.md#_CPPv4N7M_Dtype17M_FLOAT8_E5M2FNUZE) * [`M_FLOAT16`](types.md#_CPPv4N7M_Dtype9M_FLOAT16E) * [`M_BFLOAT16`](types.md#_CPPv4N7M_Dtype10M_BFLOAT16E) * [`M_FLOAT32`](types.md#_CPPv4N7M_Dtype9M_FLOAT32E) * [`M_FLOAT64`](types.md#_CPPv4N7M_Dtype9M_FLOAT64E) * [`M_TF32`](types.md#_CPPv4N7M_Dtype6M_TF32E) * [`M_BOOL`](types.md#_CPPv4N7M_Dtype6M_BOOLE) * [`M_AllocatorType`](types.md#_CPPv415M_AllocatorType) * [`kSystem`](types.md#_CPPv4N15M_AllocatorType7kSystemE) * [`kCaching`](types.md#_CPPv4N15M_AllocatorType8kCachingE) * [`M_ValueType`](types.md#_CPPv411M_ValueType) * [`M_STRING_VALUE`](types.md#_CPPv4N11M_ValueType14M_STRING_VALUEE) * [`M_DOUBLE_VALUE`](types.md#_CPPv4N11M_ValueType14M_DOUBLE_VALUEE) * [`M_LONG_VALUE`](types.md#_CPPv4N11M_ValueType12M_LONG_VALUEE) * [`M_BOOL_VALUE`](types.md#_CPPv4N11M_ValueType12M_BOOL_VALUEE) * [`M_TENSOR_VALUE`](types.md#_CPPv4N11M_ValueType14M_TENSOR_VALUEE) * [`M_LIST_VALUE`](types.md#_CPPv4N11M_ValueType12M_LIST_VALUEE) * [`M_TUPLE_VALUE`](types.md#_CPPv4N11M_ValueType13M_TUPLE_VALUEE) * [`M_DICT_VALUE`](types.md#_CPPv4N11M_ValueType12M_DICT_VALUEE) * [`M_NONE_VALUE`](types.md#_CPPv4N11M_ValueType12M_NONE_VALUEE) * [`M_UNKNOWN_VALUE`](types.md#_CPPv4N11M_ValueType15M_UNKNOWN_VALUEE) * [`M_MOJO_VALUE`](types.md#_CPPv4N11M_ValueType12M_MOJO_VALUEE) * [`M_PYTHON_MOJO_VALUE`](types.md#_CPPv4N11M_ValueType19M_PYTHON_MOJO_VALUEE) * [`M_FrameworkFormat`](types.md#_CPPv417M_FrameworkFormat) * [`M_MAX_GRAPH_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat28M_MAX_GRAPH_FRAMEWORK_FORMATE) * [`M_TORCHSCRIPT_MODULE_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat37M_TORCHSCRIPT_MODULE_FRAMEWORK_FORMATE) * [`M_TORCHSCRIPT_FUNCTION_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat39M_TORCHSCRIPT_FUNCTION_FRAMEWORK_FORMATE) * [`M_TORCH_MLIR_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat29M_TORCH_MLIR_FRAMEWORK_FORMATE) * [`M_ResultOutputStyle`](types.md#_CPPv419M_ResultOutputStyle) * [`M_COMPACT`](types.md#_CPPv4N19M_ResultOutputStyle9M_COMPACTE) * [`M_FULL`](types.md#_CPPv4N19M_ResultOutputStyle6M_FULLE) * [`M_BINARY`](types.md#_CPPv4N19M_ResultOutputStyle8M_BINARYE) * [`M_BINARY_MAX_CHECKPOINT`](types.md#_CPPv4N19M_ResultOutputStyle23M_BINARY_MAX_CHECKPOINTE) * [`M_NONE`](types.md#_CPPv4N19M_ResultOutputStyle6M_NONEE) * [Value](value.md) * [`M_getValueByNameFrom()`](value.md#_CPPv420M_getValueByNameFromP16M_AsyncTensorMapPKcP8M_Status) * [`M_getValueFromMapIterator()`](value.md#_CPPv425M_getValueFromMapIteratorP19M_TensorMapIterator) * [`M_freeValue()`](value.md#_CPPv411M_freeValueP12M_AsyncValue) * [`M_getStringFromValue()`](value.md#_CPPv420M_getStringFromValueP12M_AsyncValue) * [`M_createStringAsyncValue()`](value.md#_CPPv424M_createStringAsyncValuePKcP16M_RuntimeContext) * [`M_getDoubleFromValue()`](value.md#_CPPv420M_getDoubleFromValueP12M_AsyncValue) * [`M_createDoubleAsyncValue()`](value.md#_CPPv424M_createDoubleAsyncValuedP16M_RuntimeContext) * [`M_getLongFromValue()`](value.md#_CPPv418M_getLongFromValueP12M_AsyncValue) * [`M_createLongAsyncValue()`](value.md#_CPPv422M_createLongAsyncValue7int64_tP16M_RuntimeContext) * [`M_getBoolFromValue()`](value.md#_CPPv418M_getBoolFromValueP12M_AsyncValue) * [`M_createBoolAsyncValue()`](value.md#_CPPv422M_createBoolAsyncValuebP16M_RuntimeContext) * [`M_borrowValueInto()`](value.md#_CPPv417M_borrowValueIntoP16M_AsyncTensorMapPKcPK12M_AsyncValueP8M_Status) * [`M_getValueType()`](value.md#_CPPv414M_getValueTypeP12M_AsyncValue) * [`M_getDictFromValue()`](value.md#_CPPv418M_getDictFromValueP12M_AsyncValue) * [`M_createDictAsyncValue()`](value.md#_CPPv422M_createDictAsyncValueP16M_RuntimeContext) * [`M_insertIntoDict()`](value.md#_CPPv416M_insertIntoDictP11M_AsyncDictP12M_AsyncValueP12M_AsyncValue) * [`M_getListFromValue()`](value.md#_CPPv418M_getListFromValueP12M_AsyncValue) * [`M_createListAsyncValue()`](value.md#_CPPv422M_createListAsyncValueP16M_RuntimeContext) * [`M_appendToList()`](value.md#_CPPv414M_appendToListP11M_AsyncListP12M_AsyncValue) * [`M_getTupleFromValue()`](value.md#_CPPv419M_getTupleFromValueP12M_AsyncValue) * [`M_borrowIntoTuple()`](value.md#_CPPv417M_borrowIntoTupleP12M_AsyncTupleP12M_AsyncValue) * [`M_createTupleAsyncValue()`](value.md#_CPPv423M_createTupleAsyncValueP16M_RuntimeContext) * [`M_getDictSize()`](value.md#_CPPv413M_getDictSizeP11M_AsyncDict) * [`M_getListSize()`](value.md#_CPPv413M_getListSizeP11M_AsyncList) * [`M_getTupleSize()`](value.md#_CPPv414M_getTupleSizeP12M_AsyncTuple) * [`M_getDictKey()`](value.md#_CPPv412M_getDictKeyP11M_AsyncDict6size_t) * [`M_getDictValue()`](value.md#_CPPv414M_getDictValueP11M_AsyncDict6size_t) * [`M_getListValue()`](value.md#_CPPv414M_getListValueP11M_AsyncList6size_t) * [`M_getTupleValue()`](value.md#_CPPv415M_getTupleValueP12M_AsyncTuple6size_t) * [`M_createNoneAsyncValue()`](value.md#_CPPv422M_createNoneAsyncValueP16M_RuntimeContext) * [`M_freeDict()`](value.md#_CPPv410M_freeDictP11M_AsyncDict) * [`M_freeList()`](value.md#_CPPv410M_freeListP11M_AsyncList) * [`M_freeTuple()`](value.md#_CPPv411M_freeTupleP12M_AsyncTuple) * [`M_freeNone()`](value.md#_CPPv410M_freeNoneP11M_AsyncNone) ## Async API usage Our C API allows for compiling and executing models asynchronously. In general, effective use of asynchronous APIs may be difficult, but rewarding for performance. To help with this, we’re going to explain some important concepts and mental models to keep in mind with the API. Our APIs are async-safe unless stated otherwise, typically with a `Sync` in the function identifier name. For example, we have `M_executeModel` and [`M_executeModelSync()`](model.md#_CPPv418M_executeModelSyncPK16M_RuntimeContextP12M_AsyncModelP16M_AsyncTensorMapP8M_Status). ### Types Our API describes the underlying async-holding types with a “value or error” concept. Conceptually, this means that the type is in one of three states: * `Constructed` - the value is not yet there, but there is no error * `Available` - the value is there and ready for use * `Error` - the value is not there and there is an error ### Synchronization points When using async APIs, it is a good idea to be mindful of the synchronization point APIs currently provided below. This is useful for discerning between the `Constructed` and `Available` states mentioned above. After calling the synchronization point, the input will never be in a `Constructed` state: it will always resolve to either being `Available` or `Error`. * [`M_waitForCompilation()`](model.md#_CPPv420M_waitForCompilationP20M_AsyncCompiledModelP8M_Status) * [`M_waitForModel()`](model.md#_CPPv414M_waitForModelP12M_AsyncModelP8M_Status) * `M_waitForTensors` ### Errors Errors surface immediately when using our synchronous APIs. Otherwise, in the case of async APIs, errors will not surface until the next synchronization point. You can query the error message by calling [`M_getError()`](common.md#_CPPv410M_getErrorPK8M_Status). --- ## cache_params ## `KVCacheParams` {#max.nn.kv_cache.cache_params.KVCacheParams} > *class* max.nn.kv\_cache.cache\_params.KVCacheParams(dtype: max.\_core.dtype.DType, n\_kv\_heads: int, head\_dim: int, enable\_prefix\_caching: bool = False, enable\_kvcache\_swapping\_to\_host: bool = False, host\_kvcache\_swap\_space\_gb: Optional\[float] = None, cache\_strategy: max.nn.kv_cache.cache_params.KVCacheStrategy = \, page\_size: Optional\[int] = None, n\_devices: int = 1) **Parameters:** * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) * **n\_kv\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **enable\_prefix\_caching** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **enable\_kvcache\_swapping\_to\_host** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **host\_kvcache\_swap\_space\_gb** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) * **cache\_strategy** ([`KVCacheStrategy`](#max.nn.kv_cache.cache_params.KVCacheStrategy) ) * **page\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **n\_devices** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `cache_strategy` {#max.nn.kv_cache.cache_params.KVCacheParams.cache_strategy} > cache\_strategy\*: [KVCacheStrategy](#max.nn.kv_cache.cache_params.KVCacheStrategy)\* *= 'continuous'* ### `dtype` {#max.nn.kv_cache.cache_params.KVCacheParams.dtype} > dtype\*: [DType](../../dtype.md#max.dtype.DType)\* ### `dtype_shorthand` {#max.nn.kv_cache.cache_params.KVCacheParams.dtype_shorthand} > *property* dtype\_shorthand\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* The textual representation in shorthand of the dtype. ### `enable_kvcache_swapping_to_host` {#max.nn.kv_cache.cache_params.KVCacheParams.enable_kvcache_swapping_to_host} > enable\_kvcache\_swapping\_to\_host\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* ### `enable_prefix_caching` {#max.nn.kv_cache.cache_params.KVCacheParams.enable_prefix_caching} > enable\_prefix\_caching\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* ### `head_dim` {#max.nn.kv_cache.cache_params.KVCacheParams.head_dim} > head\_dim\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `host_kvcache_swap_space_gb` {#max.nn.kv_cache.cache_params.KVCacheParams.host_kvcache_swap_space_gb} > host\_kvcache\_swap\_space\_gb\*: [float](https://docs.python.org/3/library/functions.html#float) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `n_devices` {#max.nn.kv_cache.cache_params.KVCacheParams.n_devices} > n\_devices\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1* ### `n_kv_heads` {#max.nn.kv_cache.cache_params.KVCacheParams.n_kv_heads} > n\_kv\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `page_size` {#max.nn.kv_cache.cache_params.KVCacheParams.page_size} > page\_size\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `static_cache_shape` {#max.nn.kv_cache.cache_params.KVCacheParams.static_cache_shape} > *property* static\_cache\_shape\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str)]\* ## `KVCacheStrategy` {#max.nn.kv\_cache.cache\_params.KVCacheStrategy} > *class* max.nn.kv\_cache.cache\_params.KVCacheStrategy(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `CONTINUOUS` {#max.nn.kv_cache.cache_params.KVCacheStrategy.CONTINUOUS} > CONTINUOUS *= 'continuous'* ### `MODEL_DEFAULT` {#max.nn.kv_cache.cache_params.KVCacheStrategy.MODEL_DEFAULT} > MODEL\_DEFAULT *= 'model\_default'* ### `PAGED` {#max.nn.kv_cache.cache_params.KVCacheStrategy.PAGED} > PAGED *= 'paged'* ### `kernel_substring()` {#max.nn.kv_cache.cache_params.KVCacheStrategy.kernel_substring} > kernel\_substring() Returns the common substring that we include in the kernel name for this caching strategy. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `uses_opaque()` {#max.nn.kv_cache.cache_params.KVCacheStrategy.uses_opaque} > uses\_opaque() **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) --- ## CacheEviction `@register_passable(trivial)` `struct CacheEviction` Represents cache eviction policies for GPU memory operations. This struct defines different cache eviction priorities that control how data is evicted from cache when space is needed. The policies affect cache utilization and performance by controlling which data gets evicted first. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `EVICT_FIRST` `alias EVICT_FIRST = CacheEviction(1)` Highest eviction priority - data will be evicted first. Data cached with this priority is marked as the first candidate for eviction when cache space is needed. This is optimal for: * Streaming data that will not be reused * Single-pass algorithms * Data with low temporal locality ### `EVICT_LAST` `alias EVICT_LAST = CacheEviction(2)` Lowest eviction priority - data will be evicted last. Data cached with this priority remains in cache until all higher priority data is evicted. Best used for: * Frequently accessed data * Data needed across multiple kernel launches * Critical data structures that benefit from cache persistence ### `EVICT_NORMAL` `alias EVICT_NORMAL = CacheEviction(0)` Default cache eviction priority. Data cached with normal priority follows standard cache replacement policies. This is the default behavior and suitable for most general-purpose data access patterns where no special caching requirements exist. ### `EVICT_UNCHANGED` `alias EVICT_UNCHANGED = CacheEviction(3)` Preserves existing cache eviction priority. When this policy is used: * Existing cache entries maintain their current eviction priority * No changes are made to the cache replacement order * Useful for operations that should not affect caching behavior ### `NO_ALLOCATE` `alias NO_ALLOCATE = CacheEviction(4)` Prevents cache allocation for accessed data. Data is not cached when using this policy. Optimal for: * Large sequential reads/writes * Data that will only be accessed once * Preserving cache space for more critical data * Streaming operations with no data reuse ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two CacheEviction instances are equal. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two CacheEviction instances are not equal. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two CacheEviction instances are identical. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two CacheEviction instances are not identical. **Args:** * ​other (`Self`): The CacheEviction to compare against. **Returns:** True if the eviction policies are not identical, False otherwise. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the string mnemonic for this cache eviction policy. Converts the cache eviction policy into its corresponding string representation used in GPU instructions and debugging. **Returns:** A string literal containing the mnemonic for this eviction policy. --- ## CacheOperation `@register_passable(trivial)` `struct CacheOperation` Represents different GPU cache operation policies. This struct defines various caching behaviors for GPU memory operations, controlling how data is cached and evicted at different cache levels. The policies affect performance and memory coherency. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALWAYS` `alias ALWAYS = CacheOperation(0)` Cache at all levels. This will be accessed again. Best for data that will be frequently reused across multiple threads. Provides fastest subsequent access but uses the most cache space. ### `GLOBAL` `alias GLOBAL = CacheOperation(1)` Cache at global level. Caches data only in the L2 cache, bypassing L1. Good for data shared between different thread blocks. ### `LAST_USE` `alias LAST_USE = CacheOperation(3)` Indicates the cache line will not be used again. Hints to the cache that this data can be evicted after this access. Helps optimize cache utilization. ### `STREAMING` `alias STREAMING = CacheOperation(2)` Streaming, this is likely to be accessed once. Optimizes for streaming access patterns where data is only read once. May bypass certain cache levels for better throughput. ### `VOLATILE` `alias VOLATILE = CacheOperation(4)` Don't cache, and fetch again. Forces reads/writes to bypass cache and go directly to memory. Useful for memory-mapped I/O or when cache coherency is required. ### `WRITE_BACK` `alias WRITE_BACK = CacheOperation(5)` Write back at all coherent levels. Updates all cache levels and eventually writes to memory. Most efficient for multiple writes to same location. ### `WRITE_THROUGH` `alias WRITE_THROUGH = CacheOperation(6)` Write through to system memory. Immediately writes updates to memory while updating cache. Provides stronger consistency but lower performance than write-back. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two CacheOperation instances are equal. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two CacheOperation instances are not equal. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two CacheOperation instances are identical. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two CacheOperation instances are not identical. **Args:** * ​other (`Self`): The CacheOperation to compare against. **Returns:** True if the operations are not identical, False otherwise. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the PTX mnemonic string for this cache operation. Converts the cache operation into its corresponding PTX assembly mnemonic string used in GPU instructions. **Returns:** A string literal containing the PTX mnemonic for this operation. --- ## calculate_symmetric_vector `calculate_symmetric_vector[input_dtype: DType, simd_width: Int, output_bits: Int](data: SIMD[input_dtype, simd_width]) -> Tuple[SIMD[uint8, simd_width], SIMD[input_dtype, 1]]` Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`. **Parameters:** * ​input\_dtype (`DType`): The dtype of the input tensor. * ​simd\_width (`Int`): The width of the SIMD input. * ​output\_bits (`Int`): The bits we want to fit the unsigned integral result in. **Args:** * ​data (`SIMD[input_dtype, simd_width]`): The input SIMD we want to quantize. **Returns:** A vector of the quantized values. The associated scale factor. --- ## calculate_tile_n_k `calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](n: Int, k: Int) -> IndexList[2]` Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout. **Parameters:** * ​a\_type (`DType`): The type of the A tensor. * ​b\_type (`DType`): The type of the B tensor. * ​c\_type (`DType`): The type of the C tensor. * ​kernel\_cols (`Int`): The umber of columns of the micro kernel. **Returns:** The calculated tile size to partition the matmul as (TileN, TileK). `calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](global_tile_shape: GemmShape) -> IndexList[2]` --- ## Calling Mojo from Python If you have an existing Python project that would benefit from Mojo's high-performance computing, you shouldn't have to rewrite the whole thing in Mojo. Instead, you can write just the performance-critical parts your code in Mojo and then call it from Python. :::caution Early preview Calling Mojo code from Python is in early development. You should expect a lot of changes to the API and ergonomics. Likewise, this documentation is still a work in progress. See below for [known limitations](#known-limitations). ::: ## Import a Mojo module in Python To illustrate what calling Mojo from Python looks like, we'll start with a simple example, and then dig into the details of how it works and what is possible today. Consider a project with the following structure: ```text project ├── 🐍 main.py └── 🔥 mojo_module.mojo ``` The main entrypoint is a Python program called `main.py`, and the Mojo code includes functions to call from Python. For example, let's say we want a Mojo function to take a Python value as an argument: ```mojo title="🔥 mojo_module.mojo" fn factorial(py_obj: PythonObject) raises -> Python var n = Int(py_obj) return math.factorial(n) ``` And we want to call it from Python like this: ```python title="🐍 main.py" import mojo_module print(mojo_module.factorial(5)) ``` However, before we can call the Mojo function from Python, we must declare it so Python knows it exists. Because Python is trying to load `mojo_module`, it looks for a function called `PyInit_mojo_module()`. (If our file is called `foo.mojo`, the function would be `PyInit_foo()`.) Within the `PyInit_mojo_module()`, we must declare all Mojo functions and types that are callable from Python using [`PythonModuleBuilder`](/mojo/stdlib/python/bindings/PythonModuleBuilder). So the complete Mojo code looks like this: ```mojo title="🔥 mojo_module.mojo" from python import PythonObject from python.bindings import PythonModuleBuilder import math from os import abort @export fn PyInit_mojo_module() -> PythonObject: try: var m = PythonModuleBuilder("mojo_module") m.def_function[factorial]("factorial", docstring="Compute n!") return m.finalize() except e: return abort[PythonObject](String("error creating Python Mojo module:", e)) fn factorial(py_obj: PythonObject) raises -> PythonObject: # Raises an exception if `py_obj` is not convertible to a Mojo `Int`. var n = Int(py_obj) return math.factorial(n) ``` On the Python side, we currently need some more boilerplate code to make it work (but this will improve soon): ```python title="🐍 main.py" import max._mojo.mojo_importer import os import sys sys.path.insert(0, "") os.environ["MOJO_PYTHON_LIBRARY"] = "" import mojo_module print(mojo_module.factorial(5)) ``` That's it! Try it: ```sh python main.py ``` ```output 120 ``` ### How it works Python supports a standard mechanism called [Python extension modules](https://docs.python.org/3/extending/extending.html) that enables compiled languages (like Mojo, C, C++, or Rust) to make themselves callable from Python in an intuitive way. Concretely, a Python extension module is simply a dynamic library that defines a suitable `PyInit_*()` function. Mojo comes with built-in functionality for defining Python extension modules. The special stuff happens in the `max._mojo.mojo_importer` module we imported. If we have a look at the filesystem after Python imports the Mojo code, we'll notice there's a new `__mojocache__` directory, with dynamic library (`.so`) file inside: ```text project ├── main.py ├── mojo_module.mojo └── __mojocache__ └── mojo_module.hash-ABC123.so ``` Loading `max._mojo.mojo_importer` loads our Python Mojo [import hook](https://docs.python.org/3/reference/import.html#import-hooks), which behind the scenes looks for a `.mojo` (or `.🔥`) file that matches the imported module name, and if found, compiles it using [`mojo build --emit shared-lib`](/mojo/cli/build#--emit-file_type) to generate a static library. The resulting file is stored in `__mojocache__`, and is rebuilt only when it becomes stale (typically, when the Mojo source file changes). Now that we've looked at the basics of how Mojo can be used from Python, let's dig into the available features and how you can leverage them to accelerate your Python with Mojo. ## Binding Mojo types All Mojo type are eligible to be bound for use from Python. To expose a Mojo type to Python, it must implement the [`TypeIdentifiable`][TypeIdentifiable] trait. For a simple Mojo type, this might look like: ```mojo title="🔥 Mojo" struct Person(TypeIdentifiable, ...): var name: String var age: Int # Unique name under which the type object is stored. # Eventually this will be a compiler provided unique type ID. alias TYPE_ID = "mojo_module.Person" ``` :::note Types currently must also implement `Movable`, `Defaultable`, and `Representable` to be bound for use from Python. ::: This enables the type to be bound using `PythonModuleBuilder.add_type[Person]()`: ```mojo title="🔥 Mojo" var mb = PythonModuleBuilder("mojo_module") mb.add_type[Person]("Person") ``` Any Mojo type bound using a `PythonTypeBuilder` will have the resulting Python 'type' object be globally registered, enabling two features: * Constructing Python objects that wrap Mojo values for use from Python using `PythonObject(alloc=Person(..))`. * Downcasting using `python_obj.downcast_value_ptr[Person]()` {/* */} ## Constructing Python objects in Mojo :::note Python Mojo bindings do not currently support Mojo `__init__()` methods that take arguments. However, a workaround is possible using free functions that construct new objects, shown below. ::: Mojo functions called from Python don't just need to be able to accept [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) values as arguments, they also need to be able to return new values. And sometimes, they even need to be able to return Mojo native values back to Python. This is possible by using the `PythonObject(alloc=)` constructor. An example of this looks like: ```mojo title="🔥 Mojo" fn create_person() -> PythonObject: var person = Person("Sarah", 32) return PythonObject(alloc=person^) ``` :::caution `PythonObject(alloc=...)` will raise an exception if the provided Mojo object type had not previously been registered using [`PythonModuleBuilder.add_type()`](/mojo/stdlib/python/bindings/PythonModuleBuilder#add_type). ::: {/* TODO: How to distinguish this constructor from the converting constructor? TODO: Maybe `PythonObject.mojo()` / `PythonObject(mojo_object=)`? TODO: `PythonObject.__init__[T: TypeIdentifiable](out self, *, owned alloc: T) */} {/* */} ## `PythonObject` to Mojo values Within any Mojo code that is handling a [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject), but especially within Mojo functions called from Python, it's common to expect an argument of a particular type. particular type and wish There are two scenarios where a `PythonObject` can be "converted" into a native Mojo value: * **Converting** a Python object into a newly constructed Mojo value that has the same logical value as the original Python object. This is handled by the [`ConvertibleFromPython`][ConvertibleFromPython] trait. * **Downcasting** a Python object that holds a native Mojo value to a pointer to that inner value. This is handled by [`PythonObject.downcast_value_ptr()`][downcast_value_ptr]. {/* */} ### `PythonObject` conversions :::note **Binding Initializers.** One current limitation is that non-default Mojo type `__init__()` methods cannot be bound for calling from Python. In addition to showing argument conversions, this example also shows how a top-level function can be used to construct and return instances of Mojo types to Python. ::: Many Mojo types support conversion directly from equivalent Python types, via the [`ConvertibleFromPython`][ConvertibleFromPython] trait: ```mojo title="🔥 Mojo" fn create_person( name_obj: PythonObject, age_obj: PythonObject ) raises -> PythonObject: # These conversions will raise an exception if they fail var name = String(name_obj) var age = Int(age_obj) return PythonObject(alloc=Person(name, age)) ``` Which could be called from Python using: ```python title="🐍 Python" person = mojo_module.create_person("John Smith", 42) ``` Passing invalid arguments would result in a type error: ```python title="🐍 Python" # TODO: What is the exact error message this emits today? person = mojo_module.create_person([1, 2, 3], {"foo": 4}) ``` ### `PythonObject` downcasts Downcasting from `PythonObject` values to the inner Mojo value: ```mojo title="🔥 Mojo" fn print_age(person_obj: PythonObject): # Raises if `obj` does not contain an instance of the Mojo `Person` type. var person = person_obj.downcast_value_ptr[Person]() # TODO(MSTDL-1581): # var person = Pointer[Person](downcast_value=person_obj) print("Person is", person[].age, "years old") ``` Unsafe mutable via downcasting is also supported. It is up to the user to ensure that this mutable pointer does not alias any other pointers to the same object within Mojo: ```mojo title="🔥 Mojo" fn birthday(person_obj: PythonObject): var person = person_obj.downcast_value_ptr[Person]() # TODO: # var person = Pointer[Person](unsafe_unique_downcast=person_obj) person[].age += 1 ``` Entirely unchecked downcasting--which does no type checking--can be done using: ```mojo title="🔥 Mojo" fn get_person(person_obj: PythonObject): var person = person_obj.unchecked_downcast_value_ptr[Person]() # TODO: # var person = Pointer[Person](unchecked_downcast_value=person_obj) ``` Unchecked downcasting can be used to eliminate overhead when optimizing a tight inner loop with Mojo, and you've benchmarked and measured that type checking downcasts is a significant bottleneck. ## Writing Python in Mojo In this approach to bindings, we embrace the flexibility of Python, and eschew trying to convert `PythonObject` arguments into the narrowly constrained, strongly-typed space of the Mojo type system, in favor of just writing some code and letting it raise an exception at runtime if we got something wrong. The flexibility of `PythonObject` enables a unique programming style, wherein Python code can be "ported" to Mojo with relatively few changes. ```python title="🐍 Python" def foo(x, y, z): x[y] = int(z) x = y + z ``` Rule of thumb: Any Python builtin function should be accessible in Mojo using `Python.()`. ```mojo title="🔥 Mojo" fn foo(x: PythonObject, y: PythonObject, z: PythonObject) -> PythonObject: x[y] = Python.int(z) x = y + z x.attr = z ``` ## Keyword arguments Keyword arguments are not currently supported natively in Python Mojo bindings, but a simple pattern can be used to provide them to users of your library, using a Python wrapper function that passes keyword arguments into Mojo using a dict. A simple example of this pattern looks like: ```python title="🐍 Python" import mojo_module def supports_kwargs(pos, *, kw1 = None, kw2 = None): mojo_module.supports_kwargs(pos, { "kw1": kw1, "kw2": kw2}) ``` ```mojo title="🔥 Mojo" fn supports_kwargs(pos: PythonObject, kwargs: PythonObject) raises: var kw1 = kwargs["kw1"] var kw2 = kwargs["kw2"] ``` Because keyword argument validation and default values are handled within the Python wrapper function, callers will get the standard argument errors they expect. And the Mojo code stays simple, as getting the keyword argument is a simple dictionary lookup. ## Variadic functions When binding functions using [`PythonModuleBuilder.def_function()`](/mojo/stdlib/python/bindings/PythonModuleBuilder#def_function), only fixed-arity functions are supported. To expose Mojo functions that accept a variadic number of arguments to Python, you can use the lower-level [`def_py_function()`](/mojo/stdlib/python/bindings/PythonModuleBuilder#def_py_function) interface, which leaves it to the user to validate the number of arguments provided. ```mojo title="🔥 Mojo" @export fn PyInit_mojo_module() -> PythonObject: try: var b = PythonModuleBuilder("mojo_module") b.def_py_function[count_args]("count_args") b.def_py_function[sum_args]("sum_args") b.def_py_function[lookup]("lookup") fn count_args(py_self: PythonObject, args: TypedPythonObject["Tuple"]): return len(args) fn sum_args(py_self: PythonObject, args: TypedPythonObject["Tuple"]): var total = args[0] for i in range(1, len(args)): total += args[i] return total fn lookup(py_self: PythonObject, args: TypedPythonObject["Tuple"]) raises: if len(args) != 2 and len(args) != 3: raise Error("lookup() expects 2 or 3 arguments") var collection = args[0] var key = args[1] try: return collection[key] except e: if len(args) == 3: return args[2] else: raise e ``` ## Building Mojo extension modules You can create and distribute your Mojo modules for Python in the following ways: * As source files, compiled on demand using the Python Mojo importer hook. The advantage of this approach is that it's easy to get started with, and keeps your project structure simple, while ensuring that your imported Mojo code is always up to date after you make an edit. * As pre-built Python extension module `.so` dynamic libraries, compiled using: ```shell $ mojo build mojo_module.mojo --emit shared-lib -o mojo_module.so ``` This has the advantage that you can specify any other necessary build options manually (optimization or debug flags, import paths, etc.), providing an "escape hatch" from the Mojo import hook abstraction for advanced users. ## Known limitations While we have big ambitions for Python to Mojo interoperability—our goal is for Mojo to be the best way to extend Python—this feature is still in early and active development, and there are some limitations to be aware of. These will be lifted over time. * **Functions taking more than 3 arguments.** Currently `PyTypeBuilder.add_function()` and related function bindings only support Mojo functions that take up to 3 `PythonObject` arguments: `fn(PythonObject, PythonObject, PythonObject)`. * **Binding non-default initializers.** Currently, only Mojo types that are default constructible (`Foo()`) can be bound and constructed using standard object init syntax from within Python. A workaround pattern is described below. * **Keyword arguments.** Currently, Mojo functions callable from Python only natively support positional arguments. (However, if you really need them, a simple pattern for supporting keyword arguments is described below.) * **Mojo package dependencies.** Mojo code that has dependencies on packages other than the Mojo stdlib (like those in the ever-growing [Modular Community](https://github.com/modular/modular-community) package channel) are currently only supported when building Mojo extension modules manually, as the Mojo import hook does not currently support a way to specify import paths for Mojo package dependencies. * **Static methods.** Binding to type `@staticmethod` methods is not currently supported. Consider using a free function (top-level function) instead for the time being. * **Properties.** Computed properties getter and setters are not currently supported. * **Expected type conversions.** A handful of Mojo standard library types can be constructed directly from equivalent Python builtin object types, by implementing the [`ConvertibleFromPython`][ConvertibleFromPython] trait. However, many Mojo standard library types do not yet implement this trait, so may require manual conversion logic if needed. {/* link reference */} [ConvertibleFromPython]: /mojo/stdlib/python/python_object/ConvertibleFromPython [TypeIdentifiable]: /mojo/stdlib/builtin/identifiable/TypeIdentifiable [downcast_value_ptr]: /mojo/stdlib/python/python_object/PythonObject#downcast_value_ptr --- ## Calling Python from Mojo The Python ecosystem is full of useful libraries, so you shouldn't have to rewrite them in Mojo. Instead, you can simply import Python packages and call Python APIs from Mojo. The Python code runs in a standard Python interpreter (CPython), so your existing Python code doesn't need to change. ## Import a Python module in Mojo To import a Python module in Mojo, just call [`Python.import_module()`](/mojo/stdlib/python/python/Python#import_module) with the module name. The following shows an example of importing the standard Python [NumPy](https://numpy.org/) package: ```mojo title="🔥 Mojo" from python import Python def main(): # This is equivalent to Python's `import numpy as np` np = Python.import_module("numpy") # Now use numpy as if writing in Python array = np.array(Python.list(1, 2, 3)) print(array) ``` Running this program produces the following output: ``` [1 2 3] ``` Assuming that you have the NumPy package installed in your [environment](#create-a-python-environment), this imports NumPy and you can use any of its features. A few things to note: * The `import_module()` method returns a reference to the module in the form of a [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper. You must store the reference in a variable and then use it as shown in the example above to access functions, classes, and other objects defined by the module. See [Mojo wrapper objects](/mojo/manual/python/types#mojo-wrapper-objects) for more information about the `PythonObject` type. * Currently, you cannot import individual members (such as a single Python class or function). You must import the whole Python module and then access members through the module name. * Mojo doesn't yet support top-level code, so the `import_module()` call must be inside another method. This means you may need to import a module multiple times or pass around a reference to the module. This works the same way as Python: importing the module multiple times won't run the initialization logic more than once, so you don't pay any performance penalty. * `import_module()` may raise an exception (for example, if the module isn't installed). If you're using it inside an `fn` function, you need to either handle errors (using a `try/except` clause), or add the `raises` keyword to the function signature. You'll also see this when calling Python functions that may raise exceptions. (Raising exceptions is much more common in Python code than in the Mojo standard library, which [limits their use for performance reasons](/mojo/roadmap#the-standard-library-has-limited-exceptions-use).) :::caution [`mojo build`](/mojo/cli/build) doesn't include the Python packages used by your Mojo project. Instead, Mojo loads the Python interpreter and Python packages at runtime, so they must be provided in the environment where you run the Mojo program (such as inside the Magic environment where you built the executable). For more information, see the section above to [create a Python environment](#create-a-python-environment). ::: ### Import a local Python module If you have some local Python code you want to use in Mojo, just add the directory to the Python path and then import the module. For example, suppose you have a Python file named `mypython.py`: ```python title="🐍 mypython.py" import numpy as np def gen_random_values(size, base): # generate a size x size array of random numbers between base and base+1 random_array = np.random.rand(size, size) return random_array + base ``` Here's how you can import it and use it in a Mojo file: ```mojo title="🔥 main.mojo" from python import Python def main(): Python.add_to_path("path/to/module") mypython = Python.import_module("mypython") values = mypython.gen_random_values(2, 3) print(values) ``` Both absolute and relative paths work with [`add_to_path()`](/mojo/stdlib/python/python/Python#add_to_path). For example, you can import from the local directory like this: ```mojo title="🔥 Mojo" Python.add_to_path(".") ``` --- ## can_enable_p2p `can_enable_p2p(ctxs: List[DeviceContext]) -> Bool` If peer-to-peer access is supported, enables it between all GPU pairs. **Args:** * ​ctxs (`List[DeviceContext]`): List of device contexts representing different GPUs. **Returns:** True if P2P access is possible between all GPU pairs, False otherwise. --- ## CausalMask `@register_passable(trivial)` `struct CausalMask` MHA causal mask ensures a token is only affected by previous tokens. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = is_nvidia_gpu()` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## cbrt `cbrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `cbrt` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `cbrt` of the input. --- ## ceil `ceil[T: Ceilable, //](value: T) -> T` Get the ceiling value of the given object. **Parameters:** * ​T (`Ceilable`): The type conforming to `Ceilable`. **Args:** * ​value (`T`): The object to get the ceiling value of. **Returns:** The ceiling value of the object. --- ## Ceilable The `Ceilable` trait describes a type that defines a ceiling operation. Types that conform to `Ceilable` will work with the builtin `ceil` function. The ceiling operation always returns the same type as the input. For example: ```mojo from math import Ceilable, ceil @value struct Complex(Ceilable): var re: Float64 var im: Float64 fn __ceil__(self) -> Self: return Self(ceil(self.re), ceil(self.im)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ceil__` `__ceil__(self: _Self) -> _Self` Return the ceiling of the Int value, which is itself. **Returns:** The Int value itself. --- ## ceildiv `ceildiv[T: CeilDivable, //](numerator: T, denominator: T) -> T` Return the rounded-up result of dividing numerator by denominator. **Parameters:** * ​T (`CeilDivable`): A type that support floor division. **Args:** * ​numerator (`T`): The numerator. * ​denominator (`T`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. `ceildiv[T: CeilDivableRaising, //](numerator: T, denominator: T) -> T` Return the rounded-up result of dividing numerator by denominator, potentially raising. **Parameters:** * ​T (`CeilDivableRaising`): A type that support floor division. **Args:** * ​numerator (`T`): The numerator. * ​denominator (`T`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. `ceildiv(numerator: IntLiteral[value], denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]` Return the rounded-up result of dividing numerator by denominator. **Args:** * ​numerator (`IntLiteral[value]`): The numerator. * ​denominator (`IntLiteral[value]`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## CeilDivable The `CeilDivable` trait describes a type that defines a ceil division operation. Types that conform to `CeilDivable` will work with the `math.ceildiv` function. For example: ```mojo from math import CeilDivable @value struct Foo(CeilDivable): var x: Float64 fn __ceildiv__(self, denominator: Self) -> Self: return Self(self.x // denominator.x) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ceildiv__` `__ceildiv__(self: _Self, denominator: _Self) -> _Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`_Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## CeilDivableRaising The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise. Types that conform to `CeilDivableRaising` will work with the `//` operator as well as the `math.ceildiv` function. For example: ```mojo from math import CeilDivableRaising @value struct Foo(CeilDivableRaising): var x: Float64 fn __ceildiv__(self, denominator: Self) raises -> Self: return Self(self.x // denominator.x) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ceildiv__` `__ceildiv__(self: _Self, denominator: _Self) -> _Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`_Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## check_arguments_arity `check_arguments_arity(arity: Int, args: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")])` Validate that the provided arguments match the expected function arity. This function checks if the number of arguments in the provided tuple matches the expected arity for a function call. If the counts don't match, it raises a descriptive error message similar to Python's built-in TypeError messages. **Args:** * ​arity (`Int`): The expected number of arguments for the function. * ​args (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]`): A tuple containing the actual arguments passed to the function. **Raises:** Error: If the argument count doesn't match the expected arity. The error message follows Python's convention for TypeError messages, indicating whether too few or too many arguments were provided. `check_arguments_arity(arity: Int, args: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], func_name: StringSlice[origin])` Validate that the provided arguments match the expected function arity. This function checks if the number of arguments in the provided tuple matches the expected arity for a function call. If the counts don't match, it raises a descriptive error message similar to Python's built-in TypeError messages. **Args:** * ​arity (`Int`): The expected number of arguments for the function. * ​args (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]`): A tuple containing the actual arguments passed to the function. * ​func\_name (`StringSlice[origin]`): The name of the function being called, used in error messages to provide better debugging information. **Raises:** Error: If the argument count doesn't match the expected arity. The error message follows Python's convention for TypeError messages, indicating whether too few or too many arguments were provided, along with the specific function name. --- ## check_cudnn_error `check_cudnn_error(stat: cudnnStatus_t)` --- ## chr `chr(c: Int) -> String` Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function. This function is in the prelude, so you don't need to import it. Example: ```mojo print(chr(97), chr(8364)) # "a €" ``` **Args:** * ​c (`Int`): An integer that represents a code point. **Returns:** A string containing a single character based on the given code point. --- ## ChunkedCausalMask `ChunkedCausalMask[local_window_size: Int]() -> OrMask[CausalMask(), ChunkedMask()]` Mask implementing Chunked Causal attention for Llama4 models. This groups the mask into chunks of size `local_window_size` and performs causal attention within each local chunk. Considering the following case: * Q\_len = 7 * K\_len = 10 * start\_pos = 3 * local\_window\_size = 4 The mask will be applied as follows: K > 0 1 2 3 4 5 6 7 8 9 Q v x--------------------x 0 | 1 1 1 1 0 0 0 0 0 0 1 | 0 0 0 0 1 0 0 0 0 0 2 | 0 0 0 0 1 1 0 0 0 0 3 | 0 0 0 0 1 1 1 0 0 0 4 | 0 0 0 0 1 1 1 1 0 0 5 | 0 0 0 0 0 0 0 0 1 0 6 | 0 0 0 0 0 0 0 0 1 1 --- ## ChunkedMask `@register_passable(trivial)` `struct ChunkedMask[local_window_size: Int]` Mask implementing Chunked attention. This groups the mask into chunks of size `local_window_size`. Considering the following case: * Q\_len = 7 * K\_len = 10 * local\_window\_size = 4 The mask will be applied as follows: K > 0 1 2 3 4 5 6 7 8 9 Q v x--------------------x 0 | 1 1 1 1 0 0 0 0 0 0 1 | 0 0 0 0 1 1 1 1 0 0 2 | 0 0 0 0 1 1 1 1 0 0 3 | 0 0 0 0 1 1 1 1 0 0 4 | 0 0 0 0 1 1 1 1 0 0 5 | 0 0 0 0 0 0 0 0 1 1 6 | 0 0 0 0 0 0 0 0 1 1 ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## clamp `clamp(val: Int, lower_bound: Int, upper_bound: Int) -> Int` Clamps the integer value vector to be in a certain range. **Args:** * ​val (`Int`): The value to clamp. * ​lower\_bound (`Int`): Minimum of the range to clamp to. * ​upper\_bound (`Int`): Maximum of the range to clamp to. **Returns:** An integer clamped to be within lower\_bound and upper\_bound. `clamp(val: UInt, lower_bound: UInt, upper_bound: UInt) -> UInt` Clamps the integer value vector to be in a certain range. **Args:** * ​val (`UInt`): The value to clamp. * ​lower\_bound (`UInt`): Minimum of the range to clamp to. * ​upper\_bound (`UInt`): Maximum of the range to clamp to. **Returns:** An integer clamped to be within lower\_bound and upper\_bound. `clamp[dtype: DType, width: Int, //](val: SIMD[dtype, width], lower_bound: SIMD[dtype, width], upper_bound: SIMD[dtype, width]) -> SIMD[dtype, width]` Clamps the values in a SIMD vector to be in a certain range. Clamp cuts values in the input SIMD vector off at the upper bound and lower bound values. For example, SIMD vector `[0, 1, 2, 3]` clamped to a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​val (`SIMD[dtype, width]`): The value to clamp. * ​lower\_bound (`SIMD[dtype, width]`): Minimum of the range to clamp to. * ​upper\_bound (`SIMD[dtype, width]`): Maximum of the range to clamp to. **Returns:** A SIMD vector containing x clamped to be within lower\_bound and upper\_bound. --- ## clobber_memory `clobber_memory()` Forces all pending memory writes to be flushed to memory. This ensures that the compiler does not optimize away memory writes if it deems them to be not necessary. In effect, this operation acts as a barrier to memory reads and writes. --- ## ClockType `@register_passable(trivial)` `struct ClockType` ## Fields * ​code (`SIMD[int32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `GRAPHICS` `alias GRAPHICS = ClockType(__init__[__mlir_type.!pop.int_literal](0))` Graphics clock domain ### `MEM` `alias MEM = ClockType(__init__[__mlir_type.!pop.int_literal](2))` Memory clock domain ### `SM` `alias SM = ClockType(__init__[__mlir_type.!pop.int_literal](1))` SM clock domain ### `VIDEO` `alias VIDEO = ClockType(__init__[__mlir_type.!pop.int_literal](2))` Video clock domain ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## cluster This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures. The module implements thread block cluster operations that enable efficient communication and synchronization between thread blocks (CTAs) within a cluster on NVIDIA Hopper architecture and newer GPUs. All functions are constrained to NVIDIA SM90+ GPUs and will raise an error if used on unsupported hardware. Note: These are low-level primitives that correspond directly to PTX/NVVM instructions and should be used with careful consideration of the underlying hardware synchronization mechanisms. ## Functions * [​`block_rank_in_cluster`](/mojo/stdlib/gpu/cluster/block_rank_in_cluster): Returns the unique identifier (rank) for the current thread block within its cluster. * [​`cluster_arrive`](/mojo/stdlib/gpu/cluster/cluster_arrive): Signals arrival at a cluster synchronization point with memory ordering guarantees. * [​`cluster_arrive_relaxed`](/mojo/stdlib/gpu/cluster/cluster_arrive_relaxed): Signals arrival at a cluster synchronization point with relaxed memory ordering. * [​`cluster_sync`](/mojo/stdlib/gpu/cluster/cluster_sync): Performs a full cluster synchronization with memory ordering guarantees. * [​`cluster_sync_relaxed`](/mojo/stdlib/gpu/cluster/cluster_sync_relaxed): Performs a full cluster synchronization with relaxed memory ordering. * [​`cluster_wait`](/mojo/stdlib/gpu/cluster/cluster_wait): Waits for all thread blocks in the cluster to arrive at the synchronization point. * [​`elect_one_sync`](/mojo/stdlib/gpu/cluster/elect_one_sync): Elects a single thread within a warp to perform an operation. --- ## cluster_arrive `cluster_arrive()` Signals arrival at a cluster synchronization point with memory ordering guarantees. This function ensures all prior memory operations from this thread block are visible to other thread blocks in the cluster before proceeding. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_arrive_relaxed `cluster_arrive_relaxed()` Signals arrival at a cluster synchronization point with relaxed memory ordering. This is a relaxed version of cluster\_arrive() that does not enforce memory ordering guarantees. It should be used when memory ordering is not required between thread blocks in the cluster. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_size `cluster_size[cluster_shape: StaticTuple[SIMD[int32, 1], 3]]() -> SIMD[int32, 1]` --- ## cluster_sync `cluster_sync()` Performs a full cluster synchronization with memory ordering guarantees. This is a convenience function that combines cluster\_arrive() and cluster\_wait() to provide a full barrier synchronization across all thread blocks in the cluster. Ensures memory ordering between thread blocks. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_sync_relaxed `cluster_sync_relaxed()` Performs a full cluster synchronization with relaxed memory ordering. This is a convenience function that combines cluster\_arrive\_relaxed() and cluster\_wait() to provide a barrier synchronization across all thread blocks in the cluster without memory ordering guarantees. Only supported on NVIDIA SM90+ GPUs. --- ## cluster_wait `cluster_wait()` Waits for all thread blocks in the cluster to arrive at the synchronization point. This function blocks until all thread blocks in the cluster have called cluster\_arrive() or cluster\_arrive\_relaxed(). Only supported on NVIDIA SM90+ GPUs. --- ## coalesce `coalesce(layout: Layout, keep_rank: Bool = False) -> Layout` Simplifies a layout by combining dimensions with contiguous strides. This function reduces the rank of a layout by merging dimensions that have contiguous memory layouts, resulting in a simpler but equivalent layout. Example: ```mojo from layout import Layout, IntTuple from layout.layout import coalesce # A layout with shape (2, (1, 4)) and stride (1, (4, 2)) can be coalesced var layout = Layout(IntTuple(2, IntTuple(1, 4)), IntTuple(1, IntTuple(4, 2))) var coalesced = coalesce(layout) # Result: Layout with shape (8) and stride (1) ``` . **Args:** * ​layout (`Layout`): The layout to coalesce. * ​keep\_rank (`Bool`): If True, maintains the original rank of the layout. Default is False. **Returns:** A simplified layout with reduced rank where possible. --- ## coalesce `coalesce[l: Layout, keep_rank: Bool = False](layout: RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[coalesce(l, keep_rank), element_type=element_type, linear_idx_type=linear_idx_type]` Coalesce adjacent dimensions in a runtime layout when possible. This optimizes the layout by merging adjacent dimensions when their relationship allows it, potentially reducing the number of dimensions. **Parameters:** * ​l (`Layout`): The static layout type to coalesce. * ​keep\_rank (`Bool`): Whether to maintain the original rank (currently unsupported). **Args:** * ​layout (`RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]`): The input `RuntimeLayout` to coalesce. **Returns:** A new `RuntimeLayout` with coalesced dimensions. --- ## codepoint Unicode codepoint handling. This module provides the `Codepoint` type for representing single Unicode scalar values. A codepoint represents a single Unicode character, restricted to valid Unicode scalar values in the ranges 0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive. The `Codepoint` type provides functionality for: * Converting between codepoints and UTF-8 encoded bytes. * Testing character properties like ASCII, digits, whitespace etc. * Converting between codepoints and strings. * Safe construction from integers with validation. Example: ```mojo from collections.string import Codepoint from testing import assert_true # Create a codepoint from a character var c = Codepoint.ord('A') # Check properties assert_true(c.is_ascii()) assert_true(c.is_ascii_upper()) # Convert to string var s = String(c) # "A" ``` ## Structs * [​`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint): A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values. --- ## Codepoint `struct Codepoint` A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values. This type is restricted to store a single Unicode [*scalar value*][1], typically encoding a single user-recognizable character. All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and 0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer value falls in these ranges. [1]: https://www.unicode.org/glossary/#unicode_scalar_value **Codepoints versus Scalar Values** Formally, Unicode defines a codespace of values in the range 0 to 0x10FFFF inclusive, and a [Unicode codepoint](https://www.unicode.org/glossary/#code_point) is any integer falling within that range. However, due to historical reasons, it became necessary to "carve out" a subset of the codespace, excluding codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding that range are known as [Unicode scalar values][1]. The codepoints in the range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate codepoints will never be assigned a semantic meaning, and can only validly appear in UTF-16 encoded text. The difference between codepoints and scalar values is a technical distiction related to the backwards-compatible workaround chosen to enable UTF-16 to encode the full range of the Unicode codespace. For simplicities sake, and to avoid a confusing clash with the Mojo `Scalar` type, this type is pragmatically named `Codepoint`, even though it is restricted to valid scalar values. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])` Construct a `Codepoint` from a code point value without checking that it falls in the valid range. Safety: The provided codepoint value MUST be a valid Unicode scalar value. Providing a value outside of the valid range could lead to undefined behavior in algorithms that depend on the validity guarantees of this type. **Args:** * ​unsafe\_unchecked\_codepoint (`SIMD[uint32, 1]`): A valid Unicode scalar value code point. `__init__(out self, codepoint: SIMD[uint8, 1])` Construct a `Codepoint` from a single byte value. This constructor cannot fail because non-negative 8-bit integers are valid Unicode scalar values. **Args:** * ​codepoint (`SIMD[uint8, 1]`): The 8-bit codepoint value to convert to a `Codepoint`. ### `__eq__` `__eq__(self, other: Self) -> Bool` Return True if this character has the same codepoint value as `other`. **Args:** * ​other (`Self`): The codepoint value to compare against. **Returns:** True if this character and `other` have the same codepoint value; False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Return True if this character has a different codepoint value from `other`. **Args:** * ​other (`Self`): The codepoint value to compare against. **Returns:** True if this character and `other` have different codepoint values; False otherwise. ### `from_u32` `static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]` Construct a `Codepoint` from a code point value. Returns None if the provided `codepoint` is not in the valid range. **Args:** * ​codepoint (`SIMD[uint32, 1]`): An integer representing a Unicode scalar value. **Returns:** A `Codepoint` if `codepoint` falls in the valid range for Unicode scalar values, otherwise None. ### `ord` `static ord(string: StringSlice[origin]) -> Self` Returns the `Codepoint` that represents the given single-character string. Given a string containing one character, return a `Codepoint` representing the codepoint of that character. For example, `Codepoint.ord("a")` returns the codepoint `97`. This is the inverse of the `chr()` function. This function is similar to the `ord()` free function, except that it returns a `Codepoint` instead of an `Int`. **Args:** * ​string (`StringSlice[origin]`): The input string, which must contain only a single character. **Returns:** A `Codepoint` representing the codepoint of the given character. ### `unsafe_decode_utf8_codepoint` `static unsafe_decode_utf8_codepoint(s: Span[SIMD[uint8, 1], origin]) -> Tuple[Codepoint, Int]` Decodes a single `Codepoint` and number of bytes read from a given UTF-8 string pointer. Safety: `_ptr` MUST point to the first byte in a **known-valid** UTF-8 character sequence. This function MUST NOT be used on unvalidated input. **Args:** * ​s (`Span[SIMD[uint8, 1], origin]`): Span to UTF-8 encoded data containing at least one valid encoded codepoint. **Returns:** The decoded codepoint `Codepoint`, as well as the number of bytes read. ### `__int__` `__int__(self) -> Int` Returns the numeric value of this scalar value as an integer. **Returns:** The numeric value of this scalar value as an integer. ### `__str__` `__str__(self) -> String` Formats this `Codepoint` as a single-character string. **Returns:** A string containing this single character. ### `is_ascii` `is_ascii(self) -> Bool` Returns True if this `Codepoint` is an ASCII character. All ASCII characters are less than or equal to codepoint value 127, and take exactly 1 byte to encode in UTF-8. **Returns:** A boolean indicating if this `Codepoint` is an ASCII character. ### `is_ascii_digit` `is_ascii_digit(self) -> Bool` Determines whether the given character is a digit \[0-9]. **Returns:** True if the character is a digit. ### `is_ascii_upper` `is_ascii_upper(self) -> Bool` Determines whether the given character is an uppercase character. This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ". **Returns:** True if the character is uppercase. ### `is_ascii_lower` `is_ascii_lower(self) -> Bool` Determines whether the given character is an lowercase character. This currently only respects the default "C" locale, i.e. returns True iff the character specified is one of "abcdefghijklmnopqrstuvwxyz". **Returns:** True if the character is lowercase. ### `is_ascii_printable` `is_ascii_printable(self) -> Bool` Determines whether the given character is a printable character. **Returns:** True if the character is a printable character, otherwise False. ### `is_python_space` `is_python_space(self) -> Bool` Determines whether this character is a Python whitespace string. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines): `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. # Examples Check if a string contains only whitespace: ```mojo from testing import assert_true, assert_false # ASCII space characters assert_true(Codepoint.ord(" ").is_python_space()) assert_true(Codepoint.ord(" ").is_python_space()) # Unicode paragraph separator: assert_true(Codepoint.from_u32(0x2029).value().is_python_space()) # Letters are not space characters assert_fales(Codepoint.ord("a").is_python_space()) ``` . **Returns:** True if this character is one of the whitespace characters listed above, otherwise False. ### `is_posix_space` `is_posix_space(self) -> Bool` Returns True if this `Codepoint` is a **space** character according to the [POSIX locale][1]. The POSIX locale is also known as the C locale. [1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01 This only respects the default "C" locale, i.e. returns True only if the character specified is one of " \t\n\v\f\r". For semantics similar to Python, use `String.isspace()`. **Returns:** True iff the character is one of the whitespace characters listed above. ### `to_u32` `to_u32(self) -> SIMD[uint32, 1]` Returns the numeric value of this scalar value as an unsigned 32-bit integer. **Returns:** The numeric value of this scalar value as an unsigned 32-bit integer. ### `unsafe_write_utf8` `unsafe_write_utf8[optimize_ascii: Bool = True](self, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]) -> UInt` Shift unicode to utf8 representation. Safety: `ptr` MUST point to at least `self.utf8_byte_length()` allocated bytes or else an out-of-bounds write will occur, which is undefined behavior. ### Unicode (represented as UInt32 BE) to UTF-8 conversion: * 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa * a * 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb * (a >> 6) | 0b11000000, b | 0b10000000 * 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc * (a >> 12) | 0b11100000, (b >> 6) | 0b10000000, c | 0b10000000 * 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc 10dddddd * (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000, d | 0b10000000 . **Parameters:** * ​optimize\_ascii (`Bool`): Optimize for languages with mostly ASCII characters. **Args:** * ​ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer value to write the encoded UTF-8 bytes. Must validly point to a sufficient number of bytes (1-4) to hold the encoded data. **Returns:** Returns the number of bytes written. ### `utf8_byte_length` `utf8_byte_length(self) -> UInt` Returns the number of UTF-8 bytes required to encode this character. The returned value is always between 1 and 4 bytes. **Returns:** Byte count of UTF-8 bytes required to encode this character. --- ## CodepointsIter `struct CodepointsIter[mut: Bool, //, origin: Origin[mut]]` Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`. ## Parameters * ​mut (`Bool`): Mutability of the underlying string data. * ​origin (`Origin[mut]`): Origin of the underlying string data. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__next__` `__next__(mut self) -> Codepoint` Get the next codepoint in the underlying string slice. This returns the next `Codepoint` encoded in the underlying string, and advances the iterator state. This function will abort if this iterator has been exhausted. **Returns:** The next character in the string. ### `__has_next__` `__has_next__(self) -> Bool` Returns True if there are still elements in this iterator. **Returns:** A boolean indicating if there are still elements in this iterator. ### `__len__` `__len__(self) -> Int` Returns the remaining length of this iterator in `Codepoint`s. The value returned from this method indicates the number of subsequent calls to `next()` that will return a value. **Returns:** Number of codepoints remaining in this iterator. ### `peek_next` `peek_next(self) -> Optional[Codepoint]` Check what the next codepoint in this iterator is, without advancing the iterator state. Repeated calls to this method will return the same value. # Examples `peek_next()` does not advance the iterator, so repeated calls will return the same value: ```mojo from collections.string import Codepoint from testing import assert_equal var input = StringSlice("123") var iter = input.codepoints() assert_equal(iter.peek_next().value(), Codepoint.ord("1")) assert_equal(iter.peek_next().value(), Codepoint.ord("1")) assert_equal(iter.peek_next().value(), Codepoint.ord("1")) # A call to `next()` return the same value as `peek_next()` had, # but also advance the iterator. assert_equal(iter.next().value(), Codepoint.ord("1")) # Later `peek_next()` calls will return the _new_ next character: assert_equal(iter.peek_next().value(), Codepoint.ord("2")) ``` . **Returns:** The next character in the underlying string, or None if the string is empty. ### `next` `next(mut self) -> Optional[Codepoint]` Get the next codepoint in the underlying string slice, or None if the iterator is empty. This returns the next `Codepoint` encoded in the underlying string, and advances the iterator state. **Returns:** A character if the string is not empty, otherwise None. --- ## CodepointSliceIter `struct CodepointSliceIter[mut: Bool, //, origin: Origin[mut], forward: Bool = True]` Iterator for `StringSlice` over substring slices containing a single Unicode codepoint. The `forward` parameter only controls the behavior of the `__next__()` method used for normal iteration. Calls to `next()` will always take an element from the front of the iterator, and calls to `next_back()` will always take an element from the end. ## Parameters * ​mut (`Bool`): Whether the slice is mutable. * ​origin (`Origin[mut]`): The origin of the underlying string data. * ​forward (`Bool`): The iteration direction. `False` is backwards. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__next__` `__next__(mut self) -> StringSlice[origin]` Get the next codepoint in the underlying string slice. This returns the next single-codepoint substring slice encoded in the underlying string, and advances the iterator state. If `forward` is set to `False`, this will return the next codepoint from the end of the string. This function will abort if this iterator has been exhausted. **Returns:** The next character in the string. ### `__has_next__` `__has_next__(self) -> Bool` Returns True if there are still elements in this iterator. **Returns:** A boolean indicating if there are still elements in this iterator. ### `__len__` `__len__(self) -> Int` Returns the remaining length of this iterator in `Codepoint`s. The value returned from this method indicates the number of subsequent calls to `next()` that will return a value. **Returns:** Number of codepoints remaining in this iterator. ### `peek_next` `peek_next(self) -> Optional[StringSlice[origin]]` Check what the next single-codepoint slice in this iterator is, without advancing the iterator state. Repeated calls to this method will return the same value. # Examples `peek_next()` does not advance the iterator, so repeated calls will return the same value: ```mojo from collections.string import Codepoint from testing import assert_equal var input = StringSlice("123") var iter = input.codepoint_slices() assert_equal(iter.peek_next().value(), "1") assert_equal(iter.peek_next().value(), "1") assert_equal(iter.peek_next().value(), "1") # A call to `next()` return the same value as `peek_next()` had, # but also advance the iterator. assert_equal(iter.next().value(), "1") # Later `peek_next()` calls will return the _new_ next character: assert_equal(iter.peek_next().value(), "2") ``` . **Returns:** The next codepoint slice in the underlying string, or None if the string is empty. ### `peek_back` `peek_back(mut self) -> Optional[StringSlice[origin]]` Check what the last single-codepoint slice in this iterator is, without advancing the iterator state. Repeated calls to this method will return the same value. # Examples `peek_back()` does not advance the iterator, so repeated calls will return the same value: ```mojo from collections.string import Codepoint from testing import assert_equal var input = StringSlice("123") var iter = input.codepoint_slices() # Repeated calls to `peek_back()` return the same value. assert_equal(iter.peek_back().value(), "3") assert_equal(iter.peek_back().value(), "3") assert_equal(iter.peek_back().value(), "3") # A call to `next_back()` return the same value as `peek_back()` had, # but also advance the iterator. assert_equal(iter.next_back().value(), "3") # Later `peek_back()` calls will return the _new_ next character: assert_equal(iter.peek_back().value(), "2") ``` . **Returns:** The last codepoint slice in the underlying string, or None if the string is empty. ### `next` `next(mut self) -> Optional[StringSlice[origin]]` Get the next codepoint slice in the underlying string slice, or None if the iterator is empty. This returns the next single-codepoint substring encoded in the underlying string, and advances the iterator state. **Returns:** A character if the string is not empty, otherwise None. ### `next_back` `next_back(mut self) -> Optional[StringSlice[origin]]` Get the last single-codepoint slice in this iterator is, or None if the iterator is empty. This returns the last codepoint slice in this iterator, and advances the iterator state. **Returns:** The last codepoint slice in the underlying string, or None if the string is empty. --- ## collections Implements the collections package. ## Packages * [​`string`](/mojo/stdlib/collections/string/): The string package provides comprehensive Unicode string handling functionality for Mojo. ## Modules * [​`bitset`](/mojo/stdlib/collections/bitset/): Provides a compact, grow-only set of non-negative integers. * [​`counter`](/mojo/stdlib/collections/counter/): Defines the `Counter` type. * [​`deque`](/mojo/stdlib/collections/deque/): Defines the Deque type. * [​`dict`](/mojo/stdlib/collections/dict/): Defines `Dict`, a collection that stores key-value pairs. * [​`inline_array`](/mojo/stdlib/collections/inline_array/): Provides a fixed-size array implementation with compile-time size checking. * [​`interval`](/mojo/stdlib/collections/interval/): A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals. * [​`linked_list`](/mojo/stdlib/collections/linked_list/): * [​`list`](/mojo/stdlib/collections/list/): Defines the List type. * [​`optional`](/mojo/stdlib/collections/optional/): Defines Optional, a type modeling a value which may or may not be present. * [​`set`](/mojo/stdlib/collections/set/): Implements the Set datatype. --- ## comm The `gpu.comm` package provides communication primitives for GPUs. This package includes functions for sending and receiving data between GPUs, as well as for synchronizing threads across GPUs. ## Modules * [​`allgather`](/mojo/stdlib/gpu/comm/allgather/): Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer. * [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/): Multi-GPU allreduce implementation for efficient tensor reduction across GPUs. --- ## Common ```c #include "max/c/common.h" ``` ## Functions ### `M_version()` > const char \*M\_version() Gets the MAX Engine version. * **Returns:** A string containing the semantic version of the MAX Engine. ### `M_newStatus()` > [M\_Status](types.md#_CPPv48M_Status) \*M\_newStatus() Creates a new status object. This is required as an argument for several functions, such as [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175) and [`M_compileModel()`](model.md#model_8h_1a88afca26a64b945885e1e1a0d09b5750). They will update the status object and you can check for errors with [`M_isError()`](#common_8h_1adb7a61f1c8f9c5e7964e8788cd437468) and get the status message with [`M_getError()`](#common_8h_1aa294beac43a0884cef8386e69a6bfc1b). For example: ```c M_Status *status = M_newStatus(); M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig(); M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status); if (M_isError(status)) { logError(M_getError(status)); return EXIT_FAILURE; } ``` * **Returns:** A pointer to the new status object. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeStatus()`](#common_8h_1ab5067fd51a5696b3679f7f629d3329c4). ### `M_getError()` > const char \*M\_getError(const [M\_Status](types.md#_CPPv48M_Status) \*status) Gets an error message from the `M_Status` parameter. You should call this only if [`M_isError()`](#common_8h_1adb7a61f1c8f9c5e7964e8788cd437468) is true. * **Parameters:** **status** – The status object for reporting errors and other messages. * **Returns:** A pointer to a null-terminated string containing the error message. ### `M_isError()` > int M\_isError(const [M\_Status](types.md#_CPPv48M_Status) \*status) Checks if status holds an error value. * **Parameters:** **status** – The status object for reporting errors and other messages. * **Returns:** `0` if there is no error, `1` otherwise. ### `M_freeStatus()` > void M\_freeStatus([M\_Status](types.md#_CPPv48M_Status) \*status) Deallocates the memory for the status object. No-op if `status` is `NULL`. * **Parameters:** **status** – The status object for reporting errors and other messages. ### `M_sizeOf()` > size\_t M\_sizeOf([M\_Dtype](types.md#_CPPv47M_Dtype) type) Gets the size (in bytes) of a data type. * **Parameters:** **type** – The data type. * **Returns:** Size in bytes of the given data type. If the data type is `M_UNKNOWN`, then `0`. ### `M_getDynamicDimensionValue()` > int64\_t M\_getDynamicDimensionValue() Gets the value representing dynamic dimension. * **Returns:** Value representing dynamic dimension. ### `M_getDynamicRankValue()` > int64\_t M\_getDynamicRankValue() Gets the value representing dynamic rank. * **Returns:** Value representing dynamic rank. --- ## compact_order `compact_order(shape: IntTuple[origin], order: IntTuple[origin]) -> IntTuple` Create a compact stride based on shape and order. This function generates a stride tuple where lower order numbers imply faster varying strides. The resulting shape and stride form a bijective layout. Performance: * Always inlined for optimal performance in tight loops. * Flattens inputs and re-nests results for consistent behavior. Example: ```mojo from layout import IntTuple from layout.int_tuple import compact_order # Create a compact layout with dimensions (2,3,4,5) and ordering (1,4,3,5) var x = compact_order(IntTuple(2,3,4,5), IntTuple(1,4,3,5)) # returns (1,8,2,24) # Create a compact layout with nested dimensions and corresponding ordering var y = compact_order(IntTuple(2,IntTuple(3,4),5), IntTuple(1,IntTuple(2,3),4)) # returns (1,(2,6),24) ``` . **Args:** * ​shape (`IntTuple[origin]`): The shape tuple defining dimensions. * ​order (`IntTuple[origin]`): The order tuple defining the relative ordering of dimensions. **Returns:** A stride tuple that creates a compact memory layout according to the specified order. --- ## comparable ## Traits * [​`Comparable`](/mojo/stdlib/builtin/comparable/Comparable): A type which can be compared with other instances of itself. * [​`GreaterThanComparable`](/mojo/stdlib/builtin/comparable/GreaterThanComparable): A type which can be greater than compared with other instances of itself. * [​`GreaterThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/GreaterThanOrEqualComparable): A type which can be greater than or equal to compared with other instances of itself. * [​`LessThanComparable`](/mojo/stdlib/builtin/comparable/LessThanComparable): A type which can be less than compared with other instances of itself. * [​`LessThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/LessThanOrEqualComparable): A type which can be less than or equal to compared with other instances of itself. --- ## Comparable A type which can be compared with other instances of itself. ## Implemented traits `AnyType`, `EqualityComparable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `LessThanComparable`, `LessThanOrEqualComparable`, `UnknownDestructibility` ## Methods ### `__lt__` `__lt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than `rhs`. ### `__le__` `__le__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than or equal to `rhs`. ### `__eq__` `__eq__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are equal according to the type's definition of equality, False otherwise. ### `__ne__` `__ne__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are not equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are not equal according to the type's definition of equality, False otherwise. ### `__gt__` `__gt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than `rhs`. ### `__ge__` `__ge__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than or equal to `rhs`. --- ## compatible `compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if two shapes are compatible for tensor operations. This function checks if shape A is compatible with shape B, meaning: 1. The total size of A and B are the same 2. Any coordinate into A can also be used as a coordinate into B Compatible can also be thought of as a partial order on A and B: A a (`IntTuple[origin]`): The first `IntTuple` to compare. * ​b (`IntTuple[origin]`): The second `IntTuple` to compare. **Returns:** True if shape A is compatible with shape B, False otherwise. --- ## CompilationTarget `@register_passable(trivial)` `struct CompilationTarget[value: target = _current_target()]` A struct that provides information about a target architecture. This struct encapsulates various methods to query target-specific information such as architecture features, OS details, endianness, and memory characteristics. ## Parameters * ​value (`target`): The target architecture to query. Defaults to the current target. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `has_sse4` `static has_sse4() -> Bool` Checks if the target supports SSE4 instructions. **Returns:** True if the target supports SSE4, False otherwise. ### `is_x86` `static is_x86() -> Bool` Checks if the target is an x86 architecture. **Returns:** True if the target is x86, False otherwise. --- ## compile Provides utilities for compiling and inspecting Mojo code. This module contains functionality for compiling Mojo functions and examining their assembly, LLVM IR, or object code output. It is particularly useful for kernel engineers who want to inspect the low-level implementation details of specific functions without dealing with entire files or manual invocation of compilation tools. Key features: * Compile individual functions to assembly, LLVM IR, or object code * Get linkage names and module information * Inspect number of captures and other function metadata * Write compilation output to files * Control compilation options and targets Example: ```mojo from compile import compile_info fn my_func(x: Int) -> Int: return x # Get assembly for the function info = compile_info[my_func]() print(info) ``` ## Structs * [​`Info`](/mojo/stdlib/compile/compile/Info): Contains compilation information and results for a function. ## Functions * [​`compile_info`](/mojo/stdlib/compile/compile/compile_info): Compiles a function and returns detailed compilation information. --- ## compile Provides utilities for compiling and inspecting Mojo code at runtime. This module exposes functionality for compiling individual Mojo functions and examining their low-level implementation details. It is particularly useful for: * Inspecting assembly, LLVM IR, or object code output * Getting linkage names and module information * Examining function metadata like captures * Writing compilation output to files * Controlling compilation options and targets Example: ```mojo from compile import compile_info fn my_func(): print("Hello") # Get assembly for the function info = compile_info[my_func]() print(info.asm) ``` ## Modules * [​`compile`](/mojo/stdlib/compile/compile/): Provides utilities for compiling and inspecting Mojo code. * [​`reflection`](/mojo/stdlib/compile/reflection/): --- ## compile Implements functions that return compile-time information. ## Aliases ### `DebugLevel` `alias DebugLevel = _DebugLevel()` Represents the debug level used during compilation. ### `OptimizationLevel` `alias OptimizationLevel = _OptimizationLevel()` Represents the optimization level used during compilation. ## Functions * [​`is_compile_time`](/mojo/stdlib/sys/compile/is_compile_time): Returns true if the current code is executed at compile time, false otherwise. --- ## compile_info `compile_info[func_type: AnyTrivialRegType, //, func: func_type, /, *, emission_kind: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("asm"), compile_options: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: target = _current_target()]() -> Info[func_type, func, target]` Compiles a function and returns detailed compilation information. This function takes a Mojo function and compiles it, providing access to the generated assembly code, linkage information, and other compilation artifacts. It can be used for inspection, debugging, and low-level optimization. Example: ```mojo from compile import compile_info fn my_func(x: Int) -> Int: return x info = compile_info[my_func]() print(info) # Print assembly ``` Note: The compilation is always performed, even if the function is not used. For performance-critical code, consider caching the compilation results. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function to compile. Must be a trivially-copyable register type. * ​func (`func_type`): The function to compile. Must match the specified func\_type. * ​emission\_kind (`StringSlice[StaticConstantOrigin]`): The desired output format. Valid options are: * "asm": Assembly code (default). * "llvm": Unoptimized LLVM IR. * "llvm-opt": Optimized LLVM IR. * "object": Object code. * ​compile\_options (`StringSlice[StaticConstantOrigin]`): Additional compiler flags and options as a string. * ​target (`target`): The target architecture to compile for. Defaults to current architecture. **Returns:** An `Info` struct containing: * asm: The generated code in the requested format * linkage\_name: The mangled function name for linking * module\_hash: A unique hash of the compiled module * num\_captures: Number of captured variables * error: Any error message (empty if successful) * failed: Boolean indicating if compilation failed --- ## compiler ## Functions * [​`keep`](/mojo/stdlib/benchmark/compiler/keep): Provides a hint to the compiler to not optimize the variable use away. --- ## complement `complement(layout: Layout, size: Int = 1) -> Layout` Computes the complement layout for a given layout. This function creates a layout that represents the "gaps" or complementary structure of the input layout. It's useful for creating hierarchical layouts where you need to fill in the spaces between existing layout elements. Example: ```mojo from layout import Layout, IntTuple from layout.layout import complement # Compute the complement of a layout var base = Layout(IntTuple(2, 3), IntTuple(3, 1)) var comp = complement(base, 10) # Result: A layout that fills the gaps in the original layout ``` . **Args:** * ​layout (`Layout`): The input layout to compute the complement for. * ​size (`Int`): The total size of the memory region to consider. Defaults to 1. **Returns:** A new layout representing the complement of the input layout. --- ## complex Implements the Complex type. You can import these APIs from the `complex` package. For example: ```mojo from complex import ComplexSIMD ``` ## Aliases ### `ComplexFloat32` `alias ComplexFloat32 = ComplexSIMD[float32, 1]` ### `ComplexFloat64` `alias ComplexFloat64 = ComplexSIMD[float64, 1]` ## Structs * [​`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD): Represents a complex SIMD value. ## Functions * [​`abs`](/mojo/stdlib/complex/complex/abs): Performs elementwise abs (norm) on each element of the complex value. --- ## complex Provides types and functions for working with complex numbers. ## Modules * [​`complex`](/mojo/stdlib/complex/complex/): Implements the Complex type. --- ## ComplexSIMD `@register_passable(trivial)` `struct ComplexSIMD[type: DType, size: Int]` Represents a complex SIMD value. The class provides basic methods for manipulating complex values. ## Parameters * ​type (`DType`): DType of the value. * ​size (`Int`): SIMD width of the value. ## Fields * ​re (`SIMD[type, size]`): The real part of the complex SIMD value. * ​im (`SIMD[type, size]`): The imaginary part of the complex SIMD value. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `element_type` `alias element_type = SIMD[type, size]` ## Methods ### `__init__` `__init__(re: SIMD[type, size], im: SIMD[type, size] = __init__[__mlir_type.!pop.int_literal](0)) -> Self` Initializes a complex SIMD value. **Args:** * ​re (`SIMD[type, size]`): The real part of the complex value. * ​im (`SIMD[type, size]`): The imaginary part of the complex value. ### `__neg__` `__neg__(self) -> Self` Negates the complex value. **Returns:** The negative of the complex value. ### `__add__` `__add__(self, rhs: Self) -> Self` Adds two complex values. **Args:** * ​rhs (`Self`): Complex value to add. **Returns:** A sum of this and RHS complex values. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Subtracts two complex values. **Args:** * ​rhs (`Self`): Complex value to subtract. **Returns:** A difference of this and RHS complex values. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Multiplies two complex values. **Args:** * ​rhs (`Self`): Complex value to multiply with. **Returns:** A product of this and RHS complex values. ### `__truediv__` `__truediv__(self, rhs: Self) -> Self` Divides two complex values. **Args:** * ​rhs (`Self`): Complex value to divide by. **Returns:** A quotient of this and RHS complex values. ### `__str__` `__str__(self) -> String` Get the complex as a string. **Returns:** A string representation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this complex value to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__abs__` `__abs__(self) -> SIMD[type, size]` Returns the magnitude of the complex value. **Returns:** Value of `sqrt(re*re + im*im)`. ### `norm` `norm(self) -> SIMD[type, size]` Returns the magnitude of the complex value. **Returns:** Value of `sqrt(re*re + im*im)`. ### `squared_norm` `squared_norm(self) -> SIMD[type, size]` Returns the squared magnitude of the complex value. **Returns:** Value of `re*re + im*im`. ### `fma` `fma(self, b: Self, c: Self) -> Self` Computes FMA operation. Compute fused multiple-add with two other complex values: `result = self * b + c` **Args:** * ​b (`Self`): Multiplier complex value. * ​c (`Self`): Complex value to add. **Returns:** Computed `Self * B + C` complex value. ### `squared_add` `squared_add(self, c: Self) -> Self` Computes Square-Add operation. Compute `Self * Self + C`. **Args:** * ​c (`Self`): Complex value to add. **Returns:** Computed `Self * Self + C` complex value. ### `__exp__` `__exp__(self) -> Self` Computes the exponential of the complex value. **Returns:** The exponential of the complex value. --- ## ComposedLayout `struct ComposedLayout[LayoutA: LayoutTrait, LayoutB: LayoutTrait, offset: OptionalReg[Int] = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int {0}, 0})]` Layout composed of two layouts applied sequentially. Combines two layouts. Output of the first (`LayoutA`) is input to the second (`LayoutB`), with optional offset in between. ## Parameters * ​LayoutA (`LayoutTrait`): The first layout to apply. * ​LayoutB (`LayoutTrait`): The second layout to apply. * ​offset (`OptionalReg[Int]`): Optional offset between layouts (default: 0). ## Fields * ​layout\_a (`LayoutA`): The first layout to apply. * ​layout\_b (`LayoutB`): The second layout to apply. ## Implemented traits `AnyType`, `Copyable`, `LayoutTrait`, `UnknownDestructibility` ## Aliases ### `has_shape` `alias has_shape = get_vtable_entry(:trait LayoutA, "has_shape") if get_vtable_entry(:trait LayoutA, "has_shape") else get_vtable_entry(:trait LayoutB, "has_shape")` True if either layout has a shape. ## Methods ### `__init__` `__init__(out self, layout_a: LayoutA, layout_b: LayoutB)` Initialize ComposedLayout with two layouts. **Args:** * ​layout\_a (`LayoutA`): The first layout. * ​layout\_b (`LayoutB`): The second layout. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy constructor for ComposedLayout. **Args:** * ​other (`Self`): The ComposedLayout to copy from. ### `__call__` `__call__(self, idx: IntTuple[origin]) -> Int` Apply composed layout to an index. Applies `LayoutA`, then adds offset, then applies `LayoutB`. **Args:** * ​idx (`IntTuple[origin]`): The index to transform. **Returns:** The transformed index. `__call__(self, idx: IntTuple[origin], offset_val: Int) -> Int` Apply composed layout with runtime offset. Applies `LayoutA`, then adds runtime `offset_val`, then `LayoutB`. Static offset must not be set when using runtime offset. **Args:** * ​idx (`IntTuple[origin]`): The index to transform. * ​offset\_val (`Int`): Runtime offset to apply. **Returns:** The transformed index. ### `size` `size(self) -> Int` Get the size of the composed layout. Returns the size of the first layout (`LayoutA`). **Returns:** The size of the first layout. ### `cosize` `cosize(self) -> Int` Get the cosize of the composed layout. Returns the cosize of the second layout (`LayoutB`). **Returns:** The cosize of the second layout. --- ## composition `composition(layout_a: Layout, layout_b: Layout) -> Layout` Composes two layouts to create a new layout. This function creates a new layout by composing two layouts, where the first layout defines the outer structure and the second layout defines the inner structure. The new layout is compatible with `layout_b` (that is, it has the same `size` and every set of coordinates in `layout_b` has an equivalent in the new layout). You can think of `layout_b` as selecting a subset of elements from `layout_a`. Example: ```mojo from layout.layout import Layout, IntTuple from layout.layout import composition # Compose a row-major layout with a tiling layout var base = Layout.row_major(6, 8) var tiling = Layout(IntTuple(3, 2), IntTuple(1, 3)) var composed = composition(base, tiling) # Result: A layout that represents a 3x2 tile from # layout_a ``` . **Args:** * ​layout\_a (`Layout`): The outer layout. * ​layout\_b (`Layout`): The inner layout. **Returns:** A new layout representing the composition of the two layouts. `composition(layout_a: Layout, tiler: List[Layout]) -> Layout` Composes a layout with a list of layouts to create a hierarchical layout. This function creates a new layout by composing each element of the first layout with the corresponding element in the tiler list. If the tiler list is shorter than the layout, the remaining elements from the layout are appended unchanged. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import composition # Compose a layout with a list of tiling layouts var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2))) tilers.append(Layout(IntTuple(3, 3), IntTuple(1, 3))) var composed = composition(base, tilers) # Result: A layout with hierarchical tiling based on the tiler list ``` . **Args:** * ​layout\_a (`Layout`): The base layout to compose with the tiler. * ​tiler (`List[Layout]`): A list of layouts to compose with the base layout. **Returns:** A new layout representing the composition of the base layout with the tiler. --- ## compressed_store `compressed_store[dtype: DType, size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], mask: SIMD[bool, size])` Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`. **Parameters:** * ​dtype (`DType`): DType of `value`, the value to store. * ​size (`Int`): Size of `value`, the value to store. **Args:** * ​value (`SIMD[dtype, size]`): The vector containing data to store. * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The memory location to store the compressed data. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of `value`. --- ## concat `concat(owned lhs: IntTuple[origin], rhs: IntTuple[origin]) -> IntTuple` Concatenates two `IntTuple` instances into a single `IntTuple`. This function appends all elements from the right-hand side tuple to the left-hand side tuple, creating a new combined tuple. The operation preserves the hierarchical structure of both tuples. **Args:** * ​lhs (`IntTuple[origin]`): The left-hand side `IntTuple` that will be modified (owned parameter). * ​rhs (`IntTuple[origin]`): The right-hand side `IntTuple` whose elements will be appended. **Returns:** A new `IntTuple` containing all elements from both tuples in sequence. --- ## concat `concat[rank: Int, type: DType, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](output: NDBuffer[type, rank, origin], axis: Int, inputs: StaticTuple[NDBuffer[type, rank, MutableAnyOrigin], size], context: DeviceContextPtr = DeviceContextPtr())` --- ## concat ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None` ## Functions * [​`concat`](./concat): * [​`concat_shape`](./concat_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible. * [​`fused_concat`](./fused_concat): * [​`memcpy_or_fuse`](./memcpy_or_fuse): --- ## concat_shape `concat_shape[input_rank: Int, input_type: DType, single_thread_blocking_override: Bool](input_bufs: List[NDBuffer[input_type, input_rank, MutableAnyOrigin]], axis: Int) -> IndexList[input_rank]` Compute the output shape of a `pad` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Input\_rank of the input tensor. * ​input\_type (`DType`): Type of the input tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_bufs (`List[NDBuffer[input_type, input_rank, MutableAnyOrigin]]`): The input tensors list. * ​axis (`Int`): The axis. **Returns:** The output shape. --- ## config Standardized configuration for Pipeline Inference. ## `AudioGenerationConfig` {#max.pipelines.lib.config.AudioGenerationConfig} > *class* max.pipelines.lib.config.AudioGenerationConfig(audio\_config: 'dict\[str, str]', \*\*kwargs: 'Any') **Parameters:** * **audio\_config** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) * **kwargs** (`Any` ) ### `audio_decoder` {#max.pipelines.lib.config.AudioGenerationConfig.audio_decoder} > audio\_decoder\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''* The name of the audio decoder model architecture. ### `audio_decoder_weights` {#max.pipelines.lib.config.AudioGenerationConfig.audio_decoder_weights} > audio\_decoder\_weights\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''* The path to the audio decoder weights file. ### `audio_prompt_speakers` {#max.pipelines.lib.config.AudioGenerationConfig.audio_prompt_speakers} > audio\_prompt\_speakers\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''* The path to the audio prompt speakers file. ## `PipelineConfig` {#max.pipelines.lib.config.PipelineConfig} > *class* max.pipelines.lib.config.PipelineConfig(\*\*kwargs) Configuration for a pipeline. WIP - Once a PipelineConfig is fully initialized, it should be as immutable as possible (frozen=True). All underlying dataclass fields should have been initialized to their default values, be it user specified via some CLI flag, config file, environment variable, or internally set to a reasonable default. **Parameters:** **kwargs** (`Any` ) ### `custom_architectures` {#max.pipelines.lib.config.PipelineConfig.custom_architectures} > custom\_architectures\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)]\* A list of custom architecture implementations to register. Each input can either be a raw module name or an import path followed by a colon and the module name. Ex: * my\_module * folder/path/to/import:my\_module Each module must expose an ARCHITECTURES list of architectures to register. ### `draft_model_config` {#max.pipelines.lib.config.PipelineConfig.draft_model_config} > *property* draft\_model\_config\*: MAXModelConfig | [None](https://docs.python.org/3/library/constants.html#None)\* ### `enable_chunked_prefill` {#max.pipelines.lib.config.PipelineConfig.enable_chunked_prefill} > enable\_chunked\_prefill\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True* Enable chunked prefill to split context encoding requests into multiple chunks based on ‘target\_num\_new\_tokens’. ### `enable_echo` {#max.pipelines.lib.config.PipelineConfig.enable_echo} > enable\_echo\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* Whether the model should be built with echo capabilities. ### `enable_in_flight_batching` {#max.pipelines.lib.config.PipelineConfig.enable_in_flight_batching} > enable\_in\_flight\_batching\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* When enabled, prioritizes token generation by batching it with context encoding requests. ### `engine` {#max.pipelines.lib.config.PipelineConfig.engine} > engine\*: PipelineEngine | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage. ### `graph_quantization_encoding` {#max.pipelines.lib.config.PipelineConfig.graph_quantization_encoding} > *property* graph\_quantization\_encoding\*: [QuantizationEncoding](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) | [None](https://docs.python.org/3/library/constants.html#None)\* Converts the CLI encoding to a MAX graph quantization encoding. **Returns:** The graph quantization encoding corresponding to the CLI encoding. ### `help()` {#max.pipelines.lib.config.PipelineConfig.help} > *static* help() Documentation for this config class. Return a dictionary of config options and their descriptions. **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str)] ### `ignore_eos` {#max.pipelines.lib.config.PipelineConfig.ignore_eos} > ignore\_eos\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* Ignore EOS and continue generating tokens, even when an EOS variable is hit. ### `max_batch_size` {#max.pipelines.lib.config.PipelineConfig.max_batch_size} > max\_batch\_size\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Maximum batch size to execute with the model. This is set to one, to minimize memory consumption for the base case, in which a person is running a local server to test out MAX. For users launching in a server scenario, the expectation is that this value should be set higher based on server capacity. ### `max_ce_batch_size` {#max.pipelines.lib.config.PipelineConfig.max_ce_batch_size} > max\_ce\_batch\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 192* Maximum cache size to reserve for a single context encoding batch. The actual limit is the lesser of this and max\_batch\_size. ### `max_length` {#max.pipelines.lib.config.PipelineConfig.max_length} > max\_length\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Maximum sequence length of the model. ### `max_new_tokens` {#max.pipelines.lib.config.PipelineConfig.max_new_tokens} > max\_new\_tokens\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= -1* Maximum number of new tokens to generate during a single inference pass of the model. ### `max_num_steps` {#max.pipelines.lib.config.PipelineConfig.max_num_steps} > max\_num\_steps\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= -1* The number of steps to run for multi-step scheduling. -1 specifies a default value based on configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding models). ### `model_config` {#max.pipelines.lib.config.PipelineConfig.model_config} > *property* model\_config\*: MAXModelConfig\* ### `pad_to_multiple_of` {#max.pipelines.lib.config.PipelineConfig.pad_to_multiple_of} > pad\_to\_multiple\_of\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 2* Pad input tensors to be a multiple of value provided. ### `pdl_level` {#max.pipelines.lib.config.PipelineConfig.pdl_level} > pdl\_level\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= '1'* Level of overlap of kernel launch via programmatic dependent grid control. ### `pipeline_role` {#max.pipelines.lib.config.PipelineConfig.pipeline_role} > pipeline\_role\*: PipelineRole\* *= 'prefill\_and\_decode'* Whether the pipeline should serve both a prefill or decode role or both. ### `pool_embeddings` {#max.pipelines.lib.config.PipelineConfig.pool_embeddings} > pool\_embeddings\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True* Whether to pool embedding outputs. ### `profiling_config` {#max.pipelines.lib.config.PipelineConfig.profiling_config} > *property* profiling\_config\*: ProfilingConfig\* ### `resolve()` {#max.pipelines.lib.config.PipelineConfig.resolve} > resolve() Validates and resolves the config. This method is called after the config is initialized, to ensure that all config fields have been initialized to a valid state. **Return type:** None ### `sampling_config` {#max.pipelines.lib.config.PipelineConfig.sampling_config} > *property* sampling\_config\*: SamplingConfig\* ### `target_num_new_tokens` {#max.pipelines.lib.config.PipelineConfig.target_num_new_tokens} > target\_num\_new\_tokens\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The target number of un-encoded tokens to include in each batch. If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory. ### `use_experimental_kernels` {#max.pipelines.lib.config.PipelineConfig.use_experimental_kernels} > use\_experimental\_kernels\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= 'false'* --- ## config_in_smem `config_in_smem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, //, max_smem: Int](config: MatmulConfig[a_type, b_type, c_type, transpose_b]) -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## congruent `congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if two `IntTuple`s have the same hierarchical structure. This function checks if two `IntTuple`s have identical nesting patterns, regardless of the actual integer values they contain. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to compare. * ​b (`IntTuple[origin]`): Second `IntTuple` to compare. **Returns:** True if both `IntTuple`s have the same hierarchical structure, False otherwise. --- ## Consistency `@register_passable(trivial)` `struct Consistency` Represents memory consistency models for GPU memory operations. This struct defines different memory consistency levels that control how memory operations are ordered and synchronized between threads. The consistency model affects both performance and correctness of parallel algorithms. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ACQUIRE` `alias ACQUIRE = Consistency(2)` Acquire consistency for synchronization operations. Ensures all subsequent memory operations are ordered after this operation. Used in producer-consumer patterns. ### `RELAXED` `alias RELAXED = Consistency(1)` Relaxed consistency with basic ordering guarantees. Provides some ordering guarantees while still allowing optimizations. Suitable for operations that don't require strict ordering. ### `RELEASE` `alias RELEASE = Consistency(3)` Release consistency for synchronization operations. Ensures all previous memory operations are ordered before this operation. Paired with acquire operations for synchronization. ### `WEAK` `alias WEAK = Consistency(0)` Weakest consistency model with minimal ordering guarantees. Provides maximum flexibility for hardware/compiler optimizations but requires careful synchronization by the programmer. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two Consistency instances are equal. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two Consistency instances are not equal. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two Consistency instances are identical. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two Consistency instances are not identical. **Args:** * ​other (`Self`): The Consistency instance to compare against. **Returns:** True if the consistency levels are not identical, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the consistency level. **Returns:** A string describing the consistency level. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the mnemonic string for the consistency level. **Returns:** A string literal containing the consistency level mnemonic. --- ## Consistency `@register_passable(trivial)` `struct Consistency` Represents the consistency model for atomic operations. The class provides a set of constants that represent different consistency models for atomic operations. Attributes: NOT\_ATOMIC: Not atomic. UNORDERED: Unordered. MONOTONIC: Monotonic. ACQUIRE: Acquire. RELEASE: Release. ACQUIRE\_RELEASE: Acquire-release. SEQUENTIAL: Sequentially consistent. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ACQUIRE` `alias ACQUIRE = Consistency(__init__[__mlir_type.!pop.int_literal](3))` Acquire. ### `ACQUIRE_RELEASE` `alias ACQUIRE_RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](5))` Acquire-release. ### `MONOTONIC` `alias MONOTONIC = Consistency(__init__[__mlir_type.!pop.int_literal](2))` Monotonic. ### `NOT_ATOMIC` `alias NOT_ATOMIC = Consistency(__init__[__mlir_type.!pop.int_literal](0))` Not atomic. ### `RELEASE` `alias RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](4))` Release. ### `SEQUENTIAL` `alias SEQUENTIAL = Consistency(__init__[__mlir_type.!pop.int_literal](6))` Sequentially consistent. ### `UNORDERED` `alias UNORDERED = Consistency(__init__[__mlir_type.!pop.int_literal](1))` Unordered. ## Methods ### `__init__` `__init__(value: SIMD[uint8, 1]) -> Self` Constructs a new Consistency object. **Args:** * ​value (`SIMD[uint8, 1]`): The value of the consistency model. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two Consistency objects for equality. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two Consistency objects for inequality. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if the Consistency object is the same as another. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are the same, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if the Consistency object is not the same as another. **Args:** * ​other (`Self`): The other Consistency object to compare with. **Returns:** True if the objects are not the same, False otherwise. ### `__mlir_attr` `__mlir_attr(self) -> !kgen.deferred` Returns the MLIR attribute representation of the Consistency object. **Returns:** The MLIR attribute representation of the Consistency object. --- ## constant_memory_mapping This module provides functionality for mapping constant memory between host and device. The module includes the `ConstantMemoryMapping` struct which represents a mapping of constant memory that can be used for efficient data transfer between host and GPU device. ## Structs * [​`ConstantMemoryMapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/ConstantMemoryMapping): Represents a mapping of constant memory between host and device. --- ## ConstantMemoryMapping `@register_passable(trivial)` `struct ConstantMemoryMapping` Represents a mapping of constant memory between host and device. This struct encapsulates the information needed to manage constant memory that can be accessed by GPU kernels. Constant memory provides a fast, read-only cache accessible by all threads on the GPU device. Attributes: name: A string identifier for the constant memory mapping. ptr: Pointer to the memory location. byte\_count: Size of the memory mapping in bytes. ## Fields * ​name (`StringSlice[StaticConstantOrigin]`): A string identifier for the constant memory mapping. This name is used to uniquely identify the constant memory region in the GPU programming model, allowing the runtime to properly associate the memory with kernel references to constant memory symbols. * ​ptr (`UnsafePointer[NoneType]`): Pointer to the host memory location that will be mapped to device constant memory. This raw pointer represents the starting address of the memory region that will be accessible as constant memory on the GPU. The memory should remain valid for the lifetime of any kernels that access it. * ​byte\_count (`Int`): Size of the memory mapping in bytes. Specifies the total size of the constant memory region. This value is used by the runtime to determine how much data to transfer between host and device. The size must be sufficient to hold all data needed by GPU kernels. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## constants Defines math utilities. You can import these APIs from the `math` package. For example: ```mojo from math import pi ``` ## Aliases ### `e` `alias e = 2.7182818284590451` The euler constant e = 2.718281... ### `log2e` `alias log2e = 1.4426950408889634` log2e = log2(e), where e is Euler's constant. ### `pi` `alias pi = 3.1415926535897931` The mathematical constant π = 3.141592... ### `tau` `alias tau = 6.2831853071795862` The mathematical constant τ = 6.283185.... Tau is a circumference of a circle (2π). --- ## constrained `constrained[cond: Bool, msg: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]()` Asserts that the condition must be true at compile time. The `constrained()` function introduces a compile-time constraint on the enclosing function. If the condition is true at compile time, the constraint has no effect. If the condition is false, compilation fails and the message is displayed. This is similar to `static_assert` in C++. It differs from [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which is a run-time assertion. Example: ```mojo fn half[dtype: DType](a: Scalar[dtype]) -> Scalar[dtype]: constrained[ dtype.is_numeric(), "dtype must be numeric." ]() return a / 2 def main(): print(half(UInt8(5))) # prints 2 print(half(Scalar[DType.bool](True))) # constraint failed: # dtype must be numeric. ``` **Parameters:** * ​cond (`Bool`): The bool value to assert. * ​msg (`StringSlice[StaticConstantOrigin]`): The message to display on failure. * ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional messages to concatenate to msg. `constrained[cond: Bool]()` Asserts that the condition must be true at compile time. The `constrained()` function introduces a compile-time constraint on the enclosing function. If the condition is true at compile time, the constraint has no effect. If the condition is false, compilation fails and a generic message is displayed. This is similar to `static_assert` in C++. It differs from [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which is a run-time assertion. For an example, see the [first overload](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​cond (`Bool`): The bool value to assert. --- ## constrained Implements compile-time constraints. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`constrained`](/mojo/stdlib/builtin/constrained/constrained): Asserts that the condition must be true at compile time. --- ## consumer_main_loop `consumer_main_loop[accum_type: DType, a_type: DType, b_type: DType, c_reg_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, wgmma_shape: IndexList[3], a_swizzle: TensorMapSwizzle, b_swizzle: TensorMapSwizzle, transpose_b: Bool, pipeline_stages: Int, /, *, cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), promotion_frequency: Int = 1, num_consumer: Int = 1](final_c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut read_pipeline_states: PipelineState[pipeline_stages], full: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], empty: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], wgmma_op: TensorCoreAsync[accum_type, a_type, b_type, wgmma_shape, a_swizzle, b_swizzle, transpose_b], num_k_iters: Int, local_warp_group_idx: UInt, warp_group_thread_idx: UInt)` --- ## context ## `AudioGenerationRequest` {#max.pipelines.core.AudioGenerationRequest} > *class* max.pipelines.core.AudioGenerationRequest(id: 'str', input: 'str', index: 'int', model: 'str', voice: 'str | None' = None, instructions: 'str' = '', response\_format: 'AudioFormat' = \, speed: 'float' = 1.0) **Parameters:** * **id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **input** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **index** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **model** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **voice** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) * **instructions** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **response\_format** (`AudioFormat` ) * **speed** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ### `id` {#max.pipelines.core.AudioGenerationRequest.id} > id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking. ### `index` {#max.pipelines.core.AudioGenerationRequest.index} > index\*: [int](https://docs.python.org/3/library/functions.html#int)\* The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately. ### `input` {#max.pipelines.core.AudioGenerationRequest.input} > input\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* The text to generate audio for. The maximum length is 4096 characters. ### `instructions` {#max.pipelines.core.AudioGenerationRequest.instructions} > instructions\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''* Control the voice of your generated audio with additional instructions. Currently unused. ### `model` {#max.pipelines.core.AudioGenerationRequest.model} > model\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* The name of the model to be used for generating audio chunks. This should match the available models on the server and determines the behavior and capabilities of the response generation. ### `response_format` {#max.pipelines.core.AudioGenerationRequest.response_format} > response\_format\*: AudioFormat\* *= 'wav'* The format to audio in. Currently only supports wav. ### `speed` {#max.pipelines.core.AudioGenerationRequest.speed} > speed\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 1.0* The speed of the generated audio. Select a value from 0.25 to 4.0. Defaults to 1.0. ### `voice` {#max.pipelines.core.AudioGenerationRequest.voice} > voice\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The voice to use for audio generation. ## `AudioGenerator` {#max.pipelines.core.AudioGenerator} > *class* max.pipelines.core.AudioGenerator(\*args, \*\*kwargs) Interface for audio generation models. ### `decode()` {#max.pipelines.core.AudioGenerator.decode} > decode(batch, num\_tokens) Decodes speech tokens to audio bytes. **Parameters:** * **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `AudioGeneratorContext` `]` ) – Batch of audio generation contexts. * **num\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of speech tokens to decode. **Returns:** Dictionary mapping request IDs to WAV audio data. **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), DecoderOutput] ### `decoder_sample_rate` {#max.pipelines.core.AudioGenerator.decoder_sample_rate} > *property* decoder\_sample\_rate\*: [int](https://docs.python.org/3/library/functions.html#int)\* The sample rate of the decoder. ### `next_chunk()` {#max.pipelines.core.AudioGenerator.next_chunk} > next\_chunk(batch, num\_tokens) Computes the next audio chunk for a single batch. The new speech tokens are saved to the context. **Parameters:** * **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `AudioGeneratorContext` `]` ) – Batch of contexts. * **num\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of speech tokens to generate. **Returns:** Dictionary mapping request IDs to speech token generation status. **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [TextGenerationStatus](#max.pipelines.core.TextGenerationStatus)] ### `release()` {#max.pipelines.core.AudioGenerator.release} > release(context) Releases resources associated with this context. **Parameters:** **context** (`AudioGeneratorContext` ) – Finished context. **Return type:** None ## `AudioGeneratorOutput` {#max.pipelines.core.AudioGeneratorOutput} > *class* max.pipelines.core.AudioGeneratorOutput(audio\_data: 'torch.Tensor', metadata: 'dict\[str, Any]') **Parameters:** * **audio\_data** (`torch.Tensor` ) * **metadata** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `Any` `]` ) ### `audio_data` {#max.pipelines.core.AudioGeneratorOutput.audio_data} > audio\_data\*: torch.Tensor\* ### `metadata` {#max.pipelines.core.AudioGeneratorOutput.metadata} > metadata\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), Any]\* ## `EmbeddingsGenerator` {#max.pipelines.core.EmbeddingsGenerator} > *class* max.pipelines.core.EmbeddingsGenerator(\*args, \*\*kwargs) Interface for LLM embeddings-generator models. ### `encode()` {#max.pipelines.core.EmbeddingsGenerator.encode} > encode(batch) Computes embeddings for a batch of inputs. **Parameters:** **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `EmbeddingsGeneratorContext` `]` ) – Batch of contexts to generate embeddings for. **Returns:** Dictionary mapping request IDs to their corresponding embeddings. Each embedding is typically a numpy array or tensor of floating point values. **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), Any] ## `EmbeddingsResponse` {#max.pipelines.core.EmbeddingsResponse} > *class* max.pipelines.core.EmbeddingsResponse(embeddings) Container for the response from embeddings pipeline. **Parameters:** **embeddings** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) ### `embeddings` {#max.pipelines.core.EmbeddingsResponse.embeddings} > embeddings\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ## `InputContext` {#max.pipelines.core.InputContext} > *class* max.pipelines.core.InputContext(\*args, \*\*kwargs) A base class for model contexts, represent model inputs for TokenGenerators. Token array layout: . +———- full prompt ———-+ CHUNK\_SIZE\*N v . +——————–+—————+—————–+—————-+ . | completed | next\_tokens | | preallocated | . +——————–+—————+—————–+—————-+ . start\_idx ^ active\_idx ^ end\_idx ^ * completed: The tokens that have already been processed and encoded. * next\_tokens: The tokens that will be processed in the next iteration. This may be a subset of the full prompt due to chunked prefill. * preallocated: The token slots that have been preallocated. The token array resizes to multiples of CHUNK\_SIZE to accommodate the new tokens. ### `active_idx` {#max.pipelines.core.InputContext.active_idx} > *property* active\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `active_length` {#max.pipelines.core.InputContext.active_length} > *property* active\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\* num tokens input this iteration. This will be the prompt size for context encoding, and simply 1 for token generation. **Type:** Current sequence length ### `assign_to_cache()` {#max.pipelines.core.InputContext.assign_to_cache} > assign\_to\_cache(cache\_seq\_id) Assigns the context to a cache slot. **Parameters:** **cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `bump_token_indices()` {#max.pipelines.core.InputContext.bump_token_indices} > bump\_token\_indices(start\_idx=0, active\_idx=0, end\_idx=0, committed\_idx=0) Update the start\_idx, active\_idx and end\_idx without manipulating the token array. **Parameters:** * **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `cache_seq_id` {#max.pipelines.core.InputContext.cache_seq_id} > *property* cache\_seq\_id\*: [int](https://docs.python.org/3/library/functions.html#int)\* Returns the cache slot assigned to the context, raising an error if not assigned. ### `committed_idx` {#max.pipelines.core.InputContext.committed_idx} > *property* committed\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `compute_num_available_steps()` {#max.pipelines.core.InputContext.compute_num_available_steps} > compute\_num\_available\_steps(max\_seq\_len) Compute the max number of steps we can execute for a given context without exceeding the max\_seq\_len. **Parameters:** **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `current_length` {#max.pipelines.core.InputContext.current_length} > *property* current\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\* The current length of the sequence, including completed and active tokens. ### `end_idx` {#max.pipelines.core.InputContext.end_idx} > *property* end\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `generated_tokens` {#max.pipelines.core.InputContext.generated_tokens} > *property* generated\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* All generated tokens in the context. ### `ignore_eos` {#max.pipelines.core.InputContext.ignore_eos} > *property* ignore\_eos\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* ### `is_assigned_to_cache` {#max.pipelines.core.InputContext.is_assigned_to_cache} > *property* is\_assigned\_to\_cache\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Returns True if input is assigned to a cache slot, False otherwise. ### `json_schema` {#max.pipelines.core.InputContext.json_schema} > *property* json\_schema\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* A json schema to use during constrained decoding. ### `jump_ahead()` {#max.pipelines.core.InputContext.jump_ahead} > jump\_ahead(new\_token, is\_eos=False) Updates the token array, while ensuring the new token is returned to the user. **Parameters:** * **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** None ### `log_probabilities` {#max.pipelines.core.InputContext.log_probabilities} > *property* log\_probabilities\*: [int](https://docs.python.org/3/library/functions.html#int)\* When > 0, returns the log probabilities for the top N tokens for each element token in the sequence. ### `log_probabilities_echo` {#max.pipelines.core.InputContext.log_probabilities_echo} > *property* log\_probabilities\_echo\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* When True, the input tokens are added to the returned logprobs. ### `matcher` {#max.pipelines.core.InputContext.matcher} > *property* matcher\*: xgr.GrammarMatcher | [None](https://docs.python.org/3/library/constants.html#None)\* An optional xgr Grammar Matcher provided when using structured output. ### `max_length` {#max.pipelines.core.InputContext.max_length} > *property* max\_length\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* The maximum length of this sequence. ### `next_tokens` {#max.pipelines.core.InputContext.next_tokens} > *property* next\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* The next prompt tokens to be input during this iteration. This should be a 1D array of tokens of length active\_length. ### `outstanding_completion_tokens()` {#max.pipelines.core.InputContext.outstanding_completion_tokens} > outstanding\_completion\_tokens() Return the list of outstanding completion tokens and log probabilities that must be returned to the user. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [*LogProbabilities*](#max.pipelines.core.LogProbabilities) | None]] ### `prompt_tokens` {#max.pipelines.core.InputContext.prompt_tokens} > *property* prompt\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* Prompt tokens in the context. ### `reset()` {#max.pipelines.core.InputContext.reset} > reset() Resets the context’s state by combining all tokens into a new prompt. This method is used when a request is evicted, meaning that the context needed to be re-encoded in the following CE iteration. **Return type:** None ### `rollback()` {#max.pipelines.core.InputContext.rollback} > rollback(idx) Rollback and remove the last idx tokens. **Parameters:** **idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `set_draft_offset()` {#max.pipelines.core.InputContext.set_draft_offset} > set\_draft\_offset(idx) **Parameters:** **idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `set_matcher()` {#max.pipelines.core.InputContext.set_matcher} > set\_matcher(matcher) Set a grammar matcher for use during constrained decoding. **Parameters:** **matcher** (`xgr.GrammarMatcher` ) **Return type:** None ### `set_token_indices()` {#max.pipelines.core.InputContext.set_token_indices} > set\_token\_indices(start\_idx=None, active\_idx=None, end\_idx=None, committed\_idx=None) Set the token indices without manipulating the token array. **Parameters:** * **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) **Return type:** None ### `start_idx` {#max.pipelines.core.InputContext.start_idx} > *property* start\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `tokens` {#max.pipelines.core.InputContext.tokens} > *property* tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* All tokens in the context. ### `unassign_from_cache()` {#max.pipelines.core.InputContext.unassign_from_cache} > unassign\_from\_cache() Unassigns the context from a cache slot. **Return type:** None ### `update()` {#max.pipelines.core.InputContext.update} > update(new\_token, log\_probabilities=None, is\_eos=False) Updates the next\_tokens and extends existing tokens to include all generated tokens. **Parameters:** * **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities) `|` `None` ) * **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** None ## `LogProbabilities` {#max.pipelines.core.LogProbabilities} > *class* max.pipelines.core.LogProbabilities(token\_log\_probabilities, top\_log\_probabilities) Log probabilities for an individual output token. **Parameters:** * **token\_log\_probabilities** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`float`](https://docs.python.org/3/library/functions.html#float) `]` ) * **top\_log\_probabilities** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`float`](https://docs.python.org/3/library/functions.html#float) `]` `]` ) ### `token_log_probabilities` {#max.pipelines.core.LogProbabilities.token_log_probabilities} > token\_log\_probabilities Probabilities of each token. **Type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[float](https://docs.python.org/3/library/functions.html#float)] ### `top_log_probabilities` {#max.pipelines.core.LogProbabilities.top_log_probabilities} > top\_log\_probabilities Top tokens and their corresponding probabilities. **Type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[int](https://docs.python.org/3/library/functions.html#int), [float](https://docs.python.org/3/library/functions.html#float)]] ## `PipelineAudioTokenizer` {#max.pipelines.core.PipelineAudioTokenizer} > *class* max.pipelines.core.PipelineAudioTokenizer(\*args, \*\*kwargs) Interface for LLM tokenizers. ### `decode()` {#max.pipelines.core.PipelineAudioTokenizer.decode} > *async* decode(context, encoded, \*\*kwargs) Decodes response tokens to text. **Parameters:** * **context** (`AudioGeneratorContext` ) – Current generation context. * **encoded** (`TokenizerEncoded` ) – Encoded response tokens. * **kwargs** (`Any` ) – Additional keyword arguments. **Returns:** Un-encoded response text. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `encode()` {#max.pipelines.core.PipelineAudioTokenizer.encode} > *async* encode(prompt, add\_special\_tokens) Encodes text prompts as tokens. **Parameters:** * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text. * **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to add special tokens to the prompt. **Returns:** Encoded tokens. **Return type:** TokenizerEncoded **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length. ### `eos` {#max.pipelines.core.PipelineAudioTokenizer.eos} > *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\* The end of sequence token for this tokenizer. ### `expects_content_wrapping` {#max.pipelines.core.PipelineAudioTokenizer.expects_content_wrapping} > *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* If true, this tokenizer expects messages to have a content property. Text messages are formatted as: ```json { "type": "text", "content": "text content" } ``` instead of the OpenAI spec: ```json { "type": "text", "text": "text content" } ``` NOTE: Multimodal messages omit the content property. Both `image_urls` and `image` content parts are converted to: ```json { "type": "image" } ``` Their content is provided as byte arrays through the top-level property on the request object, i.e., [`TokenGeneratorRequest.images`](#max.pipelines.core.TokenGeneratorRequest.images). ### `new_context()` {#max.pipelines.core.PipelineAudioTokenizer.new_context} > *async* new\_context(request) Creates a new context from a request object. This is sent to the worker process once and then cached locally. **Parameters:** **request** ([`AudioGenerationRequest`](#max.pipelines.core.AudioGenerationRequest) ) – Incoming request. **Returns:** Initialized context. **Return type:** AudioGeneratorContext ## `PipelineTask` {#max.pipelines.core.PipelineTask} > *class* max.pipelines.core.PipelineTask(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `AUDIO_GENERATION` {#max.pipelines.core.PipelineTask.AUDIO_GENERATION} > AUDIO\_GENERATION *= 'audio\_generation'* ### `EMBEDDINGS_GENERATION` {#max.pipelines.core.PipelineTask.EMBEDDINGS_GENERATION} > EMBEDDINGS\_GENERATION *= 'embeddings\_generation'* ### `TEXT_GENERATION` {#max.pipelines.core.PipelineTask.TEXT_GENERATION} > TEXT\_GENERATION *= 'text\_generation'* ## `PipelineTokenizer` {#max.pipelines.core.PipelineTokenizer} > *class* max.pipelines.core.PipelineTokenizer(\*args, \*\*kwargs) Interface for LLM tokenizers. ### `decode()` {#max.pipelines.core.PipelineTokenizer.decode} > *async* decode(context, encoded, \*\*kwargs) Decodes response tokens to text. **Parameters:** * **context** (`TokenGeneratorContext` ) – Current generation context. * **encoded** (`TokenizerEncoded` ) – Encoded response tokens. **Returns:** Un-encoded response text. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `encode()` {#max.pipelines.core.PipelineTokenizer.encode} > *async* encode(prompt, add\_special\_tokens) Encodes text prompts as tokens. **Parameters:** * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text. * **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length. **Return type:** *TokenizerEncoded* ### `eos` {#max.pipelines.core.PipelineTokenizer.eos} > *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\* The end of sequence token for this tokenizer. ### `expects_content_wrapping` {#max.pipelines.core.PipelineTokenizer.expects_content_wrapping} > *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* If true, this tokenizer expects messages to have a content property. Text messages are formatted as: ```json { "type": "text", "content": "text content" } ``` instead of the OpenAI spec: ```json { "type": "text", "text": "text content" } ``` NOTE: Multimodal messages omit the content property. Both `image_urls` and `image` content parts are converted to: ```json { "type": "image" } ``` Their content is provided as byte arrays through the top-level property on the request object, i.e., `PipelineTokenizerRequest.images`. ### `new_context()` {#max.pipelines.core.PipelineTokenizer.new_context} > *async* new\_context(request) Creates a new context from a request object. This is sent to the worker process once and then cached locally. **Parameters:** **request** (`PipelineTokenizerRequest` ) – Incoming request. **Returns:** Initialized context. **Return type:** TokenGeneratorContext ## `TTSContext` {#max.pipelines.core.TTSContext} > *class* max.pipelines.core.TTSContext(\*args, \*\*kwargs) A context for the TTS model. ### `next_speech_tokens()` {#max.pipelines.core.TTSContext.next_speech_tokens} > next\_speech\_tokens(audio\_chunk\_size) Returns a chunk of the next unseen speech tokens. Calling this function will update the index of the last seen token. **Parameters:** **audio\_chunk\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of speech tokens to return. **Returns:** A chunk of speech tokens. **Return type:** [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ### `speech_tokens` {#max.pipelines.core.TTSContext.speech_tokens} > *property* speech\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `update_speech_tokens()` {#max.pipelines.core.TTSContext.update_speech_tokens} > update\_speech\_tokens(new\_tokens) Updates the next\_tokens **Parameters:** **new\_tokens** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** None ## `TextAndVisionContext` {#max.pipelines.core.TextAndVisionContext} > *class* max.pipelines.core.TextAndVisionContext(cache\_seq\_id, prompt, max\_length, tokens, pixel\_values, extra\_model\_args, log\_probabilities=0, log\_probabilities\_echo=False, json\_schema=None, ignore\_eos=False) A base class for model context, specifically for Vision model variants. **Parameters:** * **cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **prompt** (`Union` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `Sequence` `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `]` ) * **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **tokens** (`np.ndarray` ) * **pixel\_values** (`Sequence` `[` `np.ndarray` `]` ) * **extra\_model\_args** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `Any` `]` ) * **log\_probabilities** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **log\_probabilities\_echo** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **json\_schema** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) * **ignore\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `update()` {#max.pipelines.core.TextAndVisionContext.update} > update(new\_token, log\_probabilities=None, is\_eos=False) Updates the next\_tokens and extends existing tokens to include all generated tokens. **Parameters:** * **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities) `|` `None` ) * **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** None ## `TextContext` {#max.pipelines.core.TextContext} > *class* max.pipelines.core.TextContext(prompt, max\_length, tokens, cache\_seq\_id=None, log\_probabilities=0, log\_probabilities\_echo=False, json\_schema=None, ignore\_eos=False) A base class for model context, specifically for Text model variants. **Parameters:** * **prompt** (`Union` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `Sequence` `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `]` ) * **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **tokens** (`np.ndarray` ) * **cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **log\_probabilities** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **log\_probabilities\_echo** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **json\_schema** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) * **ignore\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `active_idx` {#max.pipelines.core.TextContext.active_idx} > *property* active\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `active_length` {#max.pipelines.core.TextContext.active_length} > *property* active\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\* num tokens input this iteration. This will be the prompt size for context encoding, and simply 1 (or more) for token generation. **Type:** Current sequence length ### `assign_to_cache()` {#max.pipelines.core.TextContext.assign_to_cache} > assign\_to\_cache(cache\_seq\_id) **Parameters:** **cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `bump_token_indices()` {#max.pipelines.core.TextContext.bump_token_indices} > bump\_token\_indices(start\_idx=0, active\_idx=0, end\_idx=0, committed\_idx=0) Update the start\_idx, active\_idx and end\_idx without manipulating the token array. **Parameters:** * **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `cache_seq_id` {#max.pipelines.core.TextContext.cache_seq_id} > *property* cache\_seq\_id\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `committed_idx` {#max.pipelines.core.TextContext.committed_idx} > *property* committed\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `compute_num_available_steps()` {#max.pipelines.core.TextContext.compute_num_available_steps} > compute\_num\_available\_steps(max\_seq\_len) Compute the max number of steps we can execute for a given context without exceeding the max\_seq\_len. **Parameters:** **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `current_length` {#max.pipelines.core.TextContext.current_length} > *property* current\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\* The current length of the sequence, including completed and active tokens. ### `end_idx` {#max.pipelines.core.TextContext.end_idx} > *property* end\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `generated_tokens` {#max.pipelines.core.TextContext.generated_tokens} > *property* generated\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `is_assigned_to_cache` {#max.pipelines.core.TextContext.is_assigned_to_cache} > *property* is\_assigned\_to\_cache\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* ### `jump_ahead()` {#max.pipelines.core.TextContext.jump_ahead} > jump\_ahead(new\_token, is\_eos=False) Updates the token array, while ensuring the new token is returned to the user. **Parameters:** * **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** None ### `next_tokens` {#max.pipelines.core.TextContext.next_tokens} > *property* next\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `outstanding_completion_tokens()` {#max.pipelines.core.TextContext.outstanding_completion_tokens} > outstanding\_completion\_tokens() Return the list of outstanding completion tokens and log probabilities that must be returned to the user. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [*LogProbabilities*](#max.pipelines.core.LogProbabilities) | None]] ### `prompt_tokens` {#max.pipelines.core.TextContext.prompt_tokens} > *property* prompt\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `reset()` {#max.pipelines.core.TextContext.reset} > reset() Resets the context’s state by combining all tokens into a new prompt. **Return type:** None ### `rollback()` {#max.pipelines.core.TextContext.rollback} > rollback(idx) **Parameters:** **idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `set_draft_offset()` {#max.pipelines.core.TextContext.set_draft_offset} > set\_draft\_offset(idx) **Parameters:** **idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `set_matcher()` {#max.pipelines.core.TextContext.set_matcher} > set\_matcher(matcher) **Parameters:** **matcher** (`xgr.GrammarMatcher` ) **Return type:** None ### `set_token_indices()` {#max.pipelines.core.TextContext.set_token_indices} > set\_token\_indices(start\_idx=None, active\_idx=None, end\_idx=None, committed\_idx=None) Set the token indices without manipulating the token array. **Parameters:** * **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) **Return type:** None ### `start_idx` {#max.pipelines.core.TextContext.start_idx} > *property* start\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `tokens` {#max.pipelines.core.TextContext.tokens} > *property* tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `unassign_from_cache()` {#max.pipelines.core.TextContext.unassign_from_cache} > unassign\_from\_cache() **Return type:** None ### `update()` {#max.pipelines.core.TextContext.update} > update(new\_token, log\_probabilities=None, is\_eos=False) Updates the next\_tokens and extends existing tokens to include all generated tokens. **Parameters:** * **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities) `|` `None` ) * **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** None ## `TextGenerationResponse` {#max.pipelines.core.TextGenerationResponse} > *class* max.pipelines.core.TextGenerationResponse(tokens, final\_status) **Parameters:** * **tokens** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TextResponse`](#max.pipelines.core.TextResponse) `]` ) * **final\_status** ([`TextGenerationStatus`](#max.pipelines.core.TextGenerationStatus) ) ### `append_token()` {#max.pipelines.core.TextGenerationResponse.append_token} > append\_token(token) **Parameters:** **token** ([`TextResponse`](#max.pipelines.core.TextResponse) ) **Return type:** None ### `final_status` {#max.pipelines.core.TextGenerationResponse.final_status} > *property* final\_status\*: [TextGenerationStatus](#max.pipelines.core.TextGenerationStatus)\* ### `is_done` {#max.pipelines.core.TextGenerationResponse.is_done} > *property* is\_done\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* ### `tokens` {#max.pipelines.core.TextGenerationResponse.tokens} > *property* tokens\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TextResponse](#max.pipelines.core.TextResponse)]\* ### `update_status()` {#max.pipelines.core.TextGenerationResponse.update_status} > update\_status(status) **Parameters:** **status** ([`TextGenerationStatus`](#max.pipelines.core.TextGenerationStatus) ) **Return type:** None ## `TextGenerationStatus` {#max.pipelines.core.TextGenerationStatus} > *class* max.pipelines.core.TextGenerationStatus(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `ACTIVE` {#max.pipelines.core.TextGenerationStatus.ACTIVE} > ACTIVE *= 'active'* ### `END_OF_SEQUENCE` {#max.pipelines.core.TextGenerationStatus.END_OF_SEQUENCE} > END\_OF\_SEQUENCE *= 'end\_of\_sequence'* ### `MAXIMUM_LENGTH` {#max.pipelines.core.TextGenerationStatus.MAXIMUM_LENGTH} > MAXIMUM\_LENGTH *= 'maximum\_length'* ### `is_done` {#max.pipelines.core.TextGenerationStatus.is_done} > *property* is\_done\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* ## `TextResponse` {#max.pipelines.core.TextResponse} > *class* max.pipelines.core.TextResponse(next\_token, log\_probabilities=None) A base class for model response, specifically for Text model variants. **Parameters:** * **next\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities) `|` `None` ) ### `next_token` {#max.pipelines.core.TextResponse.next_token} > next\_token Encoded predicted next token. **Type:** [int](https://docs.python.org/3/library/functions.html#int) | [str](https://docs.python.org/3/library/stdtypes.html#str) ### `log_probabilities` {#max.pipelines.core.TextResponse.log_probabilities} > log\_probabilities Log probabilities of each output token. **Type:** [LogProbabilities](#max.pipelines.core.LogProbabilities) | None ## `TokenGenerator` {#max.pipelines.core.TokenGenerator} > *class* max.pipelines.core.TokenGenerator(\*args, \*\*kwargs) Interface for LLM token-generator models. ### `next_token()` {#max.pipelines.core.TokenGenerator.next_token} > next\_token(batch, num\_steps) Computes the next token response for a single batch. **Parameters:** * **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `TokenGeneratorContext` `]` ) – Batch of contexts. * **int** (`num_steps` ) – Number of tokens to generate. * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Returns:** List of encoded responses (indexed by req. ID) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [TextResponse](#max.pipelines.core.TextResponse)]] ### `release()` {#max.pipelines.core.TokenGenerator.release} > release(context) Releases resources associated with this context. **Parameters:** **context** (`TokenGeneratorContext` ) – Finished context. **Return type:** None ## `TokenGeneratorRequest` {#max.pipelines.core.TokenGeneratorRequest} > *class* max.pipelines.core.TokenGeneratorRequest(id: [str](https://docs.python.org/3/library/stdtypes.html#str), index: [int](https://docs.python.org/3/library/functions.html#int), model\_name: [str](https://docs.python.org/3/library/stdtypes.html#str), prompt: [str](https://docs.python.org/3/library/stdtypes.html#str) | [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)] | NoneType = None, messages: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[max.pipelines.core.interfaces.text\_generation.TokenGeneratorRequestMessage](#max.pipelines.core.TokenGeneratorRequestMessage)] | [None](https://docs.python.org/3/library/constants.html#None) = None, images: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[bytes](https://docs.python.org/3/library/stdtypes.html#bytes)] | [None](https://docs.python.org/3/library/constants.html#None) = None, tools: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[max.pipelines.core.interfaces.text\_generation.TokenGeneratorRequestTool](#max.pipelines.core.TokenGeneratorRequestTool)] | [None](https://docs.python.org/3/library/constants.html#None) = None, response\_format: [max.pipelines.core.interfaces.text\_generation.TokenGeneratorResponseFormat](#max.pipelines.core.TokenGeneratorResponseFormat) | [None](https://docs.python.org/3/library/constants.html#None) = None, max\_new\_tokens: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None) = None, timestamp\_ns: [int](https://docs.python.org/3/library/functions.html#int) = 0, request\_path: [str](https://docs.python.org/3/library/stdtypes.html#str) = '/', logprobs: [int](https://docs.python.org/3/library/functions.html#int) = 0, echo: [bool](https://docs.python.org/3/library/functions.html#bool) = False, stop: [str](https://docs.python.org/3/library/stdtypes.html#str) | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)] | NoneType = None, ignore\_eos: [bool](https://docs.python.org/3/library/functions.html#bool) = False, chat\_template\_options: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), Any] | [None](https://docs.python.org/3/library/constants.html#None) = None) **Parameters:** * **id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **index** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **model\_name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `|` `None` ) * **messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](#max.pipelines.core.TokenGeneratorRequestMessage) `]` `|` `None` ) * **images** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`bytes`](https://docs.python.org/3/library/stdtypes.html#bytes) `]` `|` `None` ) * **tools** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestTool`](#max.pipelines.core.TokenGeneratorRequestTool) `]` `|` `None` ) * **response\_format** ([`TokenGeneratorResponseFormat`](#max.pipelines.core.TokenGeneratorResponseFormat) `|` `None` ) * **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **timestamp\_ns** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **request\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **logprobs** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **echo** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **stop** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` `|` `None` ) * **ignore\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **chat\_template\_options** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `|` `None` ) ### `chat_template_options` {#max.pipelines.core.TokenGeneratorRequest.chat_template_options} > chat\_template\_options\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Any](https://docs.python.org/3/library/typing.html#typing.Any)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Optional dictionary of options to pass when applying the chat template. ### `echo` {#max.pipelines.core.TokenGeneratorRequest.echo} > echo\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* If set to True, the response will include the original prompt along with the generated output. This can be useful for debugging or when you want to see how the input relates to the output. ### `id` {#max.pipelines.core.TokenGeneratorRequest.id} > id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* A unique identifier for the request. This ID can be used to trace and log the request throughout its lifecycle, facilitating debugging and tracking. ### `ignore_eos` {#max.pipelines.core.TokenGeneratorRequest.ignore_eos} > ignore\_eos\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* If set to True, the response will ignore the EOS token, and continue to generate until the Max tokens or a stop string is hit. ### `images` {#max.pipelines.core.TokenGeneratorRequest.images} > images\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[bytes](https://docs.python.org/3/library/stdtypes.html#bytes)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* A list of image byte arrays that can be included as part of the request. This field is optional and may be used for multimodal inputs where images are relevant to the prompt or task. ### `index` {#max.pipelines.core.TokenGeneratorRequest.index} > index\*: [int](https://docs.python.org/3/library/functions.html#int)\* The sequence order of this request within a batch. This is useful for maintaining the order of requests when processing multiple requests simultaneously, ensuring that responses can be matched back to their corresponding requests accurately. ### `logprobs` {#max.pipelines.core.TokenGeneratorRequest.logprobs} > logprobs\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 0* The number of top log probabilities to return for each generated token. A value of 0 means that log probabilities will not be returned. Useful for analyzing model confidence in its predictions. ### `max_new_tokens` {#max.pipelines.core.TokenGeneratorRequest.max_new_tokens} > max\_new\_tokens\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The maximum number of new tokens to generate in the response. If not set, the model may generate tokens until it reaches its internal limits or based on other stopping criteria. ### `messages` {#max.pipelines.core.TokenGeneratorRequest.messages} > messages\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TokenGeneratorRequestMessage](#max.pipelines.core.TokenGeneratorRequestMessage)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* A list of messages for chat-based interactions. This is used in chat completion APIs, where each message represents a turn in the conversation. If provided, the model will generate responses based on these messages. ### `model_name` {#max.pipelines.core.TokenGeneratorRequest.model_name} > model\_name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* The name of the model to be used for generating tokens. This should match the available models on the server and determines the behavior and capabilities of the response generation. ### `prompt` {#max.pipelines.core.TokenGeneratorRequest.prompt} > prompt\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The prompt to be processed by the model. This field supports legacy completion APIs and can accept either a string or a sequence of integers representing token IDs. If not provided, the model may generate output based on the messages field. ### `request_path` {#max.pipelines.core.TokenGeneratorRequest.request_path} > request\_path\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= '/'* The endpoint path for the request. This is typically used for routing and logging requests within the server infrastructure. ### `response_format` {#max.pipelines.core.TokenGeneratorRequest.response_format} > response\_format\*: [TokenGeneratorResponseFormat](#max.pipelines.core.TokenGeneratorResponseFormat) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Specifies the desired format for the model’s output. When set, it enables structured generation, which adheres to the json\_schema provided. ### `stop` {#max.pipelines.core.TokenGeneratorRequest.stop} > stop\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* //platform.openai.com/docs/api-reference/chat/create#chat-create-stop) **Type:** Optional list of stop expressions (see **Type:** https ### `timestamp_ns` {#max.pipelines.core.TokenGeneratorRequest.timestamp_ns} > timestamp\_ns\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 0* The time (in nanoseconds) when the request was received by the server. This can be useful for performance monitoring and logging purposes. ### `tools` {#max.pipelines.core.TokenGeneratorRequest.tools} > tools\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TokenGeneratorRequestTool](#max.pipelines.core.TokenGeneratorRequestTool)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* A list of tools that can be invoked during the generation process. This allows the model to utilize external functionalities or APIs to enhance its responses. ## `TokenGeneratorRequestFunction` {#max.pipelines.core.TokenGeneratorRequestFunction} > *class* max.pipelines.core.TokenGeneratorRequestFunction ### `description` {#max.pipelines.core.TokenGeneratorRequestFunction.description} > description\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* ### `name` {#max.pipelines.core.TokenGeneratorRequestFunction.name} > name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* ### `parameters` {#max.pipelines.core.TokenGeneratorRequestFunction.parameters} > parameters\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\* ## `TokenGeneratorRequestMessage` {#max.pipelines.core.TokenGeneratorRequestMessage} > *class* max.pipelines.core.TokenGeneratorRequestMessage ### `content` {#max.pipelines.core.TokenGeneratorRequestMessage.content} > content\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Any](https://docs.python.org/3/library/typing.html#typing.Any)]]\* Content can be simple string or a list of message parts of different modalities. For example: ```json { "role": "user", "content": "What'''s the weather like in Boston today?" } ``` Or: ```json { "role": "user", "content": [ { "type": "text", "text": "What'''s in this image?" }, { "type": "image_url", "image_url": { "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } } ] } ``` ### `role` {#max.pipelines.core.TokenGeneratorRequestMessage.role} > role\*: [Literal](https://docs.python.org/3/library/typing.html#typing.Literal)\['system', 'user', 'assistant']\* ## `TokenGeneratorRequestTool` {#max.pipelines.core.TokenGeneratorRequestTool} > *class* max.pipelines.core.TokenGeneratorRequestTool ### `function` {#max.pipelines.core.TokenGeneratorRequestTool.function} > function\*: [TokenGeneratorRequestFunction](#max.pipelines.core.TokenGeneratorRequestFunction)\* ### `type` {#max.pipelines.core.TokenGeneratorRequestTool.type} > type\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* ## `TokenGeneratorResponseFormat` {#max.pipelines.core.TokenGeneratorResponseFormat} > *class* max.pipelines.core.TokenGeneratorResponseFormat ### `json_schema` {#max.pipelines.core.TokenGeneratorResponseFormat.json_schema} > json\_schema\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\* ### `type` {#max.pipelines.core.TokenGeneratorResponseFormat.type} > type\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* --- ## Context ```c #include "max/c/context.h" ``` ## Functions ### `M_newRuntimeConfig()` > [M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*M\_newRuntimeConfig() Creates a new runtime config. This configures runtime details such as the number of threads and log level. By default, the config object’s number of threads will be set to `0`, which is internally used to refer to the number of physical processors in the first socket in the system. You can change this with [`M_setNumThreads()`](#context_8h_1a8734265a43df2dd1354c9f7237734aa2). You need this as an argument for [`M_newRuntimeContext()`](#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175). * **Returns:** A pointer to the new runtime config. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeRuntimeConfig()`](#context_8h_1a47f7e22f7f71da9ab5fb3a1886911610). ### `M_setNumThreads()` > void M\_setNumThreads([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, size\_t numThreads) Sets the number of threads in a runtime’s threadpool. * **Parameters:** * **config** – The runtime config. * **numThreads** – The number of threads. ### `M_setAllocatorType()` > void M\_setAllocatorType([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, [M\_AllocatorType](types.md#_CPPv415M_AllocatorType) allocatorType) Sets the memory allocator used for tensor allocations. * **Parameters:** * **config** – The runtime config. * **allocatorType** – An identifier for the type of allocator to use. Currently must be kCaching or kSystem. kCaching uses an allocator that trades-off performance for memory usage by not freeing memory immediately to the system. kSystem on the other hand frees memory immediately to the system and may not be as performant in some cases. The default is kCaching. ### `M_setCPUAffinity()` > void M\_setCPUAffinity([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, bool cpuAffinity) Sets whether CPU affinity is enabled. * **Parameters:** * **config** – The runtime config. * **cpuAffinity** – The new CPU affinity setting. ### `M_getNumThreads()` > size\_t M\_getNumThreads([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config) Gets the number of threads in a runtime’s threadpool. * **Parameters:** **config** – The runtime config. * **Returns:** The number of threads in the the runtime’s threadpool. Otherwise, `0` if [`M_setNumThreads()`](#context_8h_1a8734265a43df2dd1354c9f7237734aa2) has not been called. ### `M_getCPUAffinity()` > bool M\_getCPUAffinity([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config) Gets the current CPU affinity setting. Note that this does not guarantee that any CPU affinity will be set, however if this is false then it is guarantee that *no* CPU affinity will be set. * **Parameters:** **config** – The runtime config. * **Returns:** The current CPU affinity setting. ### `M_enableCrashLog()` > void M\_enableCrashLog([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, const char \*crashDir) Enables crash logging and sets the location where crash dumps are stored. Note that this will install signal handlers to do so; ensure that this method is called last to unwind to previously registered handlers. * **Parameters:** * **config** – The runtime config. * **crashDir** – The crash dump directory. ### `M_freeRuntimeConfig()` > void M\_freeRuntimeConfig([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config) Deallocates the memory for a runtime config. No-op if `config` is `NULL`. * **Parameters:** **config** – The runtime config. ### `M_newRuntimeContext()` > [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*M\_newRuntimeContext(const [M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, [M\_Status](types.md#_CPPv48M_Status) \*status) Creates a runtime context. The context is an application-level object that sets up various resources such as threadpool and allocators during inference. You need this before you can call [`M_compileModel()`](model.md#model_8h_1a88afca26a64b945885e1e1a0d09b5750). It’s expected that there’s only one runtime context active in an inference session at a time. We recommended you create one context and use it throughout your application. For example: ```c M_Status *status = M_newStatus(); M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig(); M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status); if (M_isError(status)) { logError(M_getError(status)); return EXIT_FAILURE; } ``` * **Parameters:** * **config** – The runtime config, from [`M_newRuntimeConfig()`](#context_8h_1a963f1d4eefd812ba8691acf516007cfc). * **status** – The status object for reporting errors. It is filled with an error message if construction of the runtime context fails. * **Returns:** A pointer to the runtime context object. On success, this is a valid pointer. On failure, this is a `NULL` pointer with an error message in the status. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeRuntimeContext()`](#context_8h_1a2434a11d8d65890c66f6b5516243a730). ### `M_freeRuntimeContext()` > void M\_freeRuntimeContext([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Deallocates the memory for a runtime context. No-op if `context` is `NULL`. * **Parameters:** **context** – The runtime context. ### `M_setDebugPrintOptions()` > void M\_setDebugPrintOptions([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_ResultOutputStyle](types.md#_CPPv419M_ResultOutputStyle) style, unsigned int precision, const char \*directory) Set the options for debugging printing of tensors when executing a model. * **Parameters:** * **context** – The runtime context. * **style** – The way the data will be printed. * **precision** – The floating point print out precision. * **directory** – The directory to store binary output. ### `M_setMojoDefineBool()` > void M\_setMojoDefineBool([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const char \*key, bool value) Sets a mojo compile-time define with an boolean value. * **Parameters:** * **context** – The runtime context. * **key** – The name of the define. * **value** – The boolean to set the define to. ### `M_setMojoDefineInt()` > void M\_setMojoDefineInt([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const char \*key, int value) Sets a mojo compile-time define with an integer value. * **Parameters:** * **context** – The runtime context. * **key** – The name of the define. * **value** – The integer to set the define to. ### `M_setMojoDefineString()` > void M\_setMojoDefineString([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const char \*key, const char \*value) Sets a mojo compile-time define with an string value. * **Parameters:** * **context** – The runtime context. * **key** – The name of the define. * **value** – The string to set the define to. --- ## Context encoding Context encoding (also known as "prefill") is the first phase in a [transformer model](transformer.mdx) that converts input data into a cached numerical representation ([KV cache](kv-cache.mdx)) and predicts the first token. It occurs after the input has already been [tokenized](tokenization.mdx) (preprocessed). Context encoding is then followed by the [autoregressive](autoregression.mdx) token generation phase, which produces one token at a time. If it weren't for the KV cache built during context encoding, the model would have to recalculate the [self-attention](self-attention.mdx) score for each token in the original input, every time it starts to predict a new token. Context encoding is usually the most computationally expensive phase in an LLM, because it must calculate attention scores for every token in the input sequence. Although this process may be parallelized across thousands of GPU threads (because each token can be processed separately), it is still a significant latency factor for time-to-first-token (TTFT). The model can usually produce subsequent tokens much faster than the first one because each round of token generation needs to calculate an attention score for only one token (the new one). --- ## Continuous batching Continuous batching is a [batching](batching.mdx) technique that can continuously dispatch inference requests to the GPU for [token generation](token-generation.mdx) and dramatically improve GPU utilization. Continuous batching can start executing a new batch even before the previous batch finishes its pass through the model, because this batching technique schedules new processing at the "token level." That is, because large language models (LLMs) generate responses one token at a time, there is a repeated cycle during inference (the token generation phase) in which a new batch can jump in to utilize the GPU, even before a previous batch finishes its pass through the model. That's what it means to operate at the "token level"—the batch scheduler focuses on keeping the GPU busy with token generation at all times, instead of waiting for the previous batch to finish its complete forward pass. This is sometimes called "in-flight batching" in cases where context encoding and token generation requests are combined into the same batch. --- ## continuous_batching_cache Continuous Batching enabled KV cache for the Transformer leveraging the mo.opaque pattern. ## `ContinuousBatchingKVCache` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCache} > *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCache(value) Continuous Mojo KV cache graph value. Value is abstract, it shouldn’t be constructed directly. **Parameters:** **value** ([`Value`](../../graph/Value.md#max.graph.Value) `|` `\_Value` `[` `mo.OpaqueType` `]` ) ## `ContinuousBatchingKVCacheCollection` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection} > *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheCollection(value) The graph value for a view of the KV cache. Value is abstract, it shouldn’t be constructed directly. **Parameters:** **value** ([`Value`](../../graph/Value.md#max.graph.Value) `|` `\_Value` `[` `mo.OpaqueType` `]` ) ## `ContinuousBatchingKVCacheCollectionType` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollectionType} > *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheCollectionType The graph type for a “view” of the cache for the given sequences in the batch. This object does not own the underlying buffers in k\_cache and v\_cache, it’s borrowing them from the BlockWrappers in our ContinuousKVCacheManager. It does own the Pointer\[NDBuffer\[type, 3]] and valid\_lengths buffer Creates an opaque type containing a continuous batching KV cache collection. ## `ContinuousBatchingKVCacheInputSymbols` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols} > *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheInputSymbols(kv\_blocks: 'TensorType', cache\_lengths: 'TensorType', lookup\_table: 'TensorType', max\_lengths: 'TensorType') **Parameters:** * **kv\_blocks** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) ) * **cache\_lengths** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) ) * **lookup\_table** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) ) * **max\_lengths** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) ) ### `cache_lengths` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.cache_lengths} > cache\_lengths\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\* ### `kv_blocks` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.kv_blocks} > kv\_blocks\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\* ### `lookup_table` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.lookup_table} > lookup\_table\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\* ### `max_lengths` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.max_lengths} > max\_lengths\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\* ## `ContinuousBatchingKVCacheManager` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager} > *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheManager(params, max\_batch\_size, max\_seq\_len, num\_layers, devices, session) **Parameters:** * **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** (`Sequence` `[` [`Device`](../../driver.md#max.driver.Device) `]` ) * **session** ([`InferenceSession`](../../engine.md#max.engine.InferenceSession) ) ### `block_shape()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.block_shape} > block\_shape(n\_sequences) Returns the shape of the KV cache blocks for the given number of sequences. Defines the 6-dimensional shape of the cache blocks used to store key and value tensors for transformer attention. The dimensions represent: \[n\_sequences, 2, num\_layers, max\_seq\_len, n\_kv\_heads\_per\_device, head\_dim] where 2 represents separate storage for keys and values. **Parameters:** **n\_sequences** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of sequences that will be cached **Returns:** sequences, key/value split, layers, sequence length, attention heads, and head dimension **Return type:** List describing the shape of the cache blocks with dimensions for ### `estimated_memory_size()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.estimated_memory_size} > *classmethod* estimated\_memory\_size(params, max\_batch\_size, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs) Returns the estimated total memory usage of the kv cache. **Parameters:** * **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` ) * **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `fetch()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.fetch} > fetch(batch, num\_steps=1) Fetches the KV cache state for the given sequence IDs. This method retrieves the current cache state for a batch of sequences, including their cache lengths and lookup information. It’s used during token generation to access previously cached key/value pairs. **Parameters:** * **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` ) – List of KVCacheAwareContext for which to fetch cache state for. * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of steps to run for multi-step scheduling. **Returns:** * blocks: Tensor containing the KV cache blocks * cache\_lengths: Tensor of current cache lengths for each sequence * lookup\_table: Tensor mapping sequence IDs to cache positions * max\_lengths: Tensor containing \[max\_seq\_length, max\_cache\_length] **Return type:** List of tuples for each device containing **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If any seq\_id exceeds max\_batch\_size or doesn’t exist in cache ### `infer_optimal_batch_size()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.infer_optimal_batch_size} > *classmethod* infer\_optimal\_batch\_size(params, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs) Returns the estimated optimal batch size for the kv cache. **Parameters:** * **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` ) * **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `input_symbols()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.input_symbols} > input\_symbols() Returns the expected input tensor types for fetch on each device. Defines the tensor specifications needed by the cache implementation, including shapes and data types. This is used for graph construction and validation. **Returns:** List of ContinuousBatchingKVCacheInputSymbols for each device containing TensorTypes for: * KV cache blocks: 6D tensor for storing keys and values * Cache lengths: 1D tensor tracking sequence lengths * Lookup table: 1D tensor mapping sequence IDs to cache positions * Maximum lengths: 2D tensor tracking maximum sequence and cache lengths per step. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*ContinuousBatchingKVCacheInputSymbols*](#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols)] ## `ContinuousBatchingKVCacheType` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheType} > *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheType Continuous Mojo KV Cache graph type. Creates an opaque type containing a continuous batching KV Cache. ## `FetchContinuousBatchingKVCacheCollection` {#max.nn.kv_cache.continuous_batching_cache.FetchContinuousBatchingKVCacheCollection} > *class* max.nn.kv\_cache.continuous\_batching\_cache.FetchContinuousBatchingKVCacheCollection(kv\_params, \*\*kwargs) **Parameters:** * **kv\_params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **kwargs** (`Any` ) --- ## ContinuousBatchingKVCache `@register_passable(trivial)` `struct ContinuousBatchingKVCache[type_: DType, kv_params_: KVCacheStaticParams, assert_write_mode: Int = 0]` Wrapper for the ContinuousKVCache of a given layer in the transformer model. This abstracts the Pointer indirection for accessing the ContinuousKVCache for a given batch entry. THIS IS THE TYPE THAT IS PASSED TO KV PROJECTION AND FLASH ATTENTION KERNELS. ## Fields * ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `KVCacheT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `kv_params` `alias kv_params = kv_params_` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self` ### `max_tile_size` `static max_tile_size() -> Int` Returns the maximum tile size for the KVCache. ### `cache_lengths_nd` `cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `load` `load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]` ### `store` `store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])` ### `empty_cache` `empty_cache(self) -> Bool` Returns true if the cache\_lengths for all requests is 0, false otherwise. ### `max_prompt_length` `max_prompt_length(self) -> SIMD[uint32, 1]` Returns the maximum sequence length across all batches of the current request. ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` Returns the maximum cache length used across all batches of the current request. ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]` --- ## ContinuousBatchingKVCacheCollection `struct ContinuousBatchingKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams, assert_write_mode: Int = 0]` This is a "view" of the cache for the given sequences in the batch. This object does not own the underlying buffers in k\_cache and v\_cache, it's borrowing them from the BlockWrappers in our KVCacheManager. It does own the Pointer\[NDBuffer\[type, 3]] and valid\_lengths buffer ## Fields * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): * ​kv\_cache\_dynamic\_shape (`IndexList[4]`): * ​kv\_cache\_dynamic\_strides (`IndexList[4]`): ## Implemented traits `AnyType`, `Copyable`, `KVCollectionT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `CacheType` `alias CacheType = ContinuousBatchingKVCache[type_, kv_params_, assert_write_mode]` ### `kv_params` `alias kv_params = kv_params_` ### `name_str` `alias name_str = "continuous_batching"` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])` ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `get_key_cache` `get_key_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_, assert_write_mode]` ### `get_value_cache` `get_value_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_, assert_write_mode]` ### `cache_length` `cache_length(self, bs_idx: Int) -> Int` --- ## Control flow Mojo includes several traditional control flow structures for conditional and repeated execution of code blocks. ## The `if` statement Mojo supports the `if` statement for conditional code execution. With it you can conditionally execute an indented code block if a given [boolean](/mojo/manual/types#booleans) expression evaluates to `True`. ```mojo temp_celsius = 25 if temp_celsius > 20: print("It is warm.") print("The temperature is", temp_celsius * 9 / 5 + 32, "Fahrenheit." ) ``` ```output It is warm. The temperature is 77.0 Fahrenheit. ``` You can write the entire `if` statement as a single line if all you need to execute conditionally is a single, short statement. ```mojo temp_celsius = 22 if temp_celsius 20: print("It is warm.") ``` ```output It is warm. ``` Optionally, an `if` statement can include any number of additional `elif` clauses, each specifying a boolean condition and associated code block to execute if `True`. The conditions are tested in the order given. When a condition evaluates to `True`, the associated code block is executed and no further conditions are tested. Additionally, an `if` statement can include an optional `else` clause providing a code block to execute if all conditions evaluate to `False`. ```mojo temp_celsius = 25 if temp_celsius Bool: print("Executing true_func") return True def false_func() -> Bool: print("Executing false_func") return False print('Short-circuit "or" evaluation') if true_func() or false_func(): print("True result") ``` ```output Short-circuit "or" evaluation Executing true_func True result ``` If the first argument to an `and` operator evaluates to `False`, the second argument is not evaluated. ```mojo print('Short-circuit "and" evaluation') if false_func() and true_func(): print("True result") ``` ```output Short-circuit "and" evaluation Executing false_func ``` ### Conditional expressions Mojo also supports conditional expressions (or what is sometimes called a [*ternary conditional operator*](https://en.wikipedia.org/wiki/Ternary_conditional_operator)) using the syntaxtrue_result if boolean_expression else false_result, just as in Python. This is most often used as a concise way to assign one of two different values to a variable, based on a boolean condition. ```mojo temp_celsius = 15 forecast = "warm" if temp_celsius > 20 else "cool" print("The forecast for today is", forecast) ``` ```output The forecast for today is cool ``` The alternative, written as a multi-line `if` statement, is more verbose. ```mojo if temp_celsius > 20: forecast = "warm" else: forecast = "cool" print("The forecast for today is", forecast) ``` ```output The forecast for today is cool ``` ## The `while` statement The `while` loop repeatedly executes a code block while a given boolean expression evaluates to `True`. For example, the following loop prints values from the Fibonacci series that are less than 50. ```mojo fib_prev = 0 fib_curr = 1 print(fib_prev, end="") while fib_curr < 50: print(",", fib_curr, end="") fib_prev, fib_curr = fib_curr, fib_prev + fib_curr ``` ```output 0, 1, 1, 2, 3, 5, 8, 13, 21, 34 ``` A `continue` statement skips execution of the rest of the code block and resumes with the loop test expression. ```mojo n = 0 while n < 5: n += 1 if n == 3: continue print(n, end=", ") ``` ```output 1, 2, 4, 5, ``` A `break` statement terminates execution of the loop. ```mojo n = 0 while n < 5: n += 1 if n == 3: break print(n, end=", ") ``` ```output 1, 2, ``` Optionally, a `while` loop can include an `else` clause. The body of the `else` clause executes when the loop's boolean condition evaluates to `False`, even if it occurs the first time tested. ```mojo n = 5 while n < 4: print(n) n += 1 else: print("Loop completed") ``` ```output Loop completed ``` :::note The `else` clause does *not* execute if a `break` or `return` statement exits the `while` loop. ::: ```mojo n = 0 while n < 5: n += 1 if n == 3: break print(n) else: print("Executing else clause") ``` ```output 1 2 ``` ## The `for` statement The `for` loop iterates over a sequence, executing a code block for each element in the sequence. The Mojo `for` loop can iterate over any type that implements an `__iter__()` method that returns a type that defines `__next__()` and `__len__()` methods. ### Iterating over Mojo collections All of the collection types in the [`collections`](/mojo/stdlib/collections) module support `for` loop iteration. See the [Collection types](/mojo/manual/types#collection-types) documentation for more information on Mojo collection types. :::caution TODO Iterating over Mojo native collections currently assigns the loop index variable a [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to each item, not the item itself. You can access the item using the dereference operator, `[]`, as shown in the examples below. This may change in a future version of Mojo. ::: The following shows an example of iterating over a Mojo [`List`](/mojo/stdlib/collections/list/List). ```mojo from collections import List states = List[String]("California", "Hawaii", "Oregon") for state in states: print(state[]) ``` ```output California Hawaii Oregon ``` The same technique works for iterating over a Mojo [`Set`](/mojo/stdlib/collections/set/Set). ```mojo from collections import Set values = Set[Int](42, 0) for item in values: print(item[]) ``` ```output 42 0 ``` There are two techniques for iterating over a Mojo [`Dict`](/mojo/stdlib/collections/dict/Dict). The first is to iterate directly using the `Dict`, which produces a sequence of the dictionary's keys. ```mojo capitals: Dict[String, String] = { "California": "Sacramento", "Hawaii": "Honolulu", "Oregon": "Salem" } for state in capitals: print(capitals[state[]] + ", " + state[]) ``` ```output Sacramento, California Honolulu, Hawaii Salem, Oregon ``` The second approach to iterating over a Mojo `Dict` is to invoke its [`items()`](/mojo/stdlib/collections/dict/Dict#items) method, which produces a sequence of [`DictEntry`](/mojo/stdlib/collections/dict/DictEntry) objects. Within the loop body, you can then access the `key` and `value` fields of the entry. ```mojo for item in capitals.items(): print(item[].value + ", " + item[].key) ``` ```output Sacramento, California Honolulu, Hawaii Salem, Oregon ``` Another type of iterable provided by the Mojo standard library is a *range*, which is a sequence of integers generated by the [`range()`](/mojo/stdlib/builtin/range/range) function. It differs from the collection types shown above in that it's implemented as a [generator](https://en.wikipedia.org/wiki/Generator_\(computer_programming\)), producing each value as needed rather than materializing the entire sequence in memory. Additionally, each value assigned to the loop index variable is simply the `Int` value rather than a `Pointer` to the value, so you should not use the dereference operator on it within the loop. For example: ```mojo for i in range(5): print(i, end=", ") ``` ```output 0, 1, 2, 3, 4, ``` ### `for` loop control statements A `continue` statement skips execution of the rest of the code block and resumes the loop with the next element of the collection. ```mojo for i in range(5): if i == 3: continue print(i, end=", ") ``` ```output 0, 1, 2, 4, ``` A `break` statement terminates execution of the loop. ```mojo for i in range(5): if i == 3: break print(i, end=", ") ``` ```output 0, 1, 2, ``` Optionally, a `for` loop can include an `else` clause. The body of the `else` clause executes after iterating over all of the elements in a collection. ```mojo for i in range(5): print(i, end=", ") else: print("\nFinished executing 'for' loop") ``` ```output 0, 1, 2, 3, 4, Finished executing 'for' loop ``` The `else` clause executes even if the collection is empty. ```mojo from collections import List empty = List[Int]() for i in empty: print(i[]) else: print("Finished executing 'for' loop") ``` ```output Finished executing 'for' loop ``` :::note The `else` clause does *not* execute if a `break` or `return` statement terminates the `for` loop. ::: ```mojo from collections import List animals = List[String]("cat", "aardvark", "hippopotamus", "dog") for animal in animals: if animal[] == "dog": print("Found a dog") break else: print("No dog found") ``` ```output Found a dog ``` ### Iterating over Python collections The Mojo `for` loop supports iterating over Python collection types. Each item retrieved by the loop is a [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper around the Python object. Refer to the [Python types](/mojo/manual/python/types) documentation for more information on manipulating Python objects from Mojo. The following is a simple example of iterating over a mixed-type Python list. ```mojo from python import Python def main(): # Create a mixed-type Python list py_list = Python.list(42, "cat", 3.14159) for py_obj in py_list: # Each element is of type "PythonObject" print(py_obj) ``` ```output 42 cat 3.14159 ``` :::note TODO Iterating over a Mojo collection currently assigns the loop index variable a `Pointer` to each element, which then requires you to use the dereference operator within the loop body. In contrast, iterating over a Python collection assigns a `PythonObject` wrapper for the element, which does *not* require you to use the dereference operator. ::: There are two techniques for iterating over a Python dictionary. The first is to iterate directly using the dictionary, which produces a sequence of its keys. ```mojo from python import Python def main(): # Create a mixed-type Python dictionary py_dict = Python.evaluate("{'a': 1, 'b': 2.71828, 'c': 'sushi'}") for py_key in py_dict: # Each key is of type "PythonObject" print(py_key, py_dict[py_key]) ``` ```output a 1 b 2.71828 c sushi ``` The second approach to iterating over a Python dictionary is to invoke its `items()` method, which produces a sequence of 2-tuple objects. Within the loop body, you can then access the key and value by index. ```mojo from python import Python def main(): # Create a mixed-type Python dictionary py_dict = Python.evaluate("{'a': 1, 'b': 2.71828, 'c': 'sushi'}") for py_tuple in py_dict.items(): # Each 2-tuple is of type "PythonObject" print(py_tuple[0], py_tuple[1]) ``` ```output a 1 b 2.71828 c sushi ``` --- ## conv The `conv` module provides classes for performing convolution operations in various dimensions (1D, 2D, and 3D) on tensor inputs. These convolution operations are core building blocks for neural networks, especially in computer vision and sequence processing tasks. Here’s an example demonstrating how to use a 1D convolution: ```python import max.nn as nn from max.graph import Graph, ops, Weight, DeviceRef from max.dtype import DType import numpy as np with Graph(name="conv_example") as graph: # Define dimensions batch_size = 2 seq_length = 10 in_channels = 16 out_channels = 32 kernel_size = 3 # Create input tensor [batch_size, sequence_length, channels] x_data = np.zeros((batch_size, seq_length, in_channels), dtype=np.float32) x = ops.constant(x_data, dtype=DType.float32, device=DeviceRef.CPU()) # Create weights for convolution filter_1d = Weight( name="filter_weight", dtype=DType.float32, shape=[kernel_size, in_channels, out_channels] device=DeviceRef.CPU() ) bias_1d = Weight( name="bias_weight", dtype=DType.float32, shape=[out_channels] device=DeviceRef.CPU() ) # Create and apply Conv1D layer conv1d = nn.Conv1D( filter=filter_1d, bias=bias_1d, stride=1, padding=1 ) output_1d = conv1d(x) print(f"Conv1D output shape: {output_1d.shape}") # Output: Conv1D output shape: [Dim(2), Dim(10), Dim(32)] ``` ## `Conv1D` {#max.nn.conv.Conv1D} > *class* max.nn.conv.Conv1D(kernel\_size, in\_channels, out\_channels, dtype, stride=1, padding=0, dilation=1, num\_groups=1, device=None, has\_bias=False, permute=False, name=None) A 1D convolution over an input signal composed of several input planes. ## Example ```python conv = nn.Conv1D( kernel_size=3, in_channels=64, out_channels=128, dtype=DType.float32, stride=1, padding=0, has_bias=False, name="conv1d_weight", device=DeviceRef.GPU(), ) ``` Initializes the Conv1D layer with weights and optional bias. **Parameters:** * **kernel\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Size of the convolving kernel. * **in\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of channels in the input signal. * **out\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of channels produced by the convolution. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias. * **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Stride of the convolution. Default: 1 * **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Padding added to both sides of the input. Default: 0 * **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Spacing between kernel elements. Default: 1 * **num\_groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of blocked connections from input channels to output channels. Default: 1 * **device** (`DeviceRef` `|` `None` ) – The target device for computation. Weights remain on CPU until moved during computation. * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) – Base name for weights (appended with `.weight` and `.bias` if applicable). * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer. Defaults to [`False`](https://docs.python.org/3/library/constants.html#False). * **permute** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `bias` {#max.nn.conv.Conv1D.bias} > bias\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The optional bias vector stored on CPU with shape (out\_channels,). Model init moves the bias to [`device`](#max.nn.conv.Conv1D.device) if present. ### `device` {#max.nn.conv.Conv1D.device} > device\*: DeviceRef | [None](https://docs.python.org/3/library/constants.html#None)\* The device where matrix operations are performed. ### `dilation` {#max.nn.conv.Conv1D.dilation} > dilation\*: [int](https://docs.python.org/3/library/functions.html#int)\* Controls the dilation rate. ### `filter` {#max.nn.conv.Conv1D.filter} > filter\*: [Weight](../graph/Weight.md#max.graph.Weight)\* The weight matrix stored on CPU with shape (kernel\_size, in\_channels / num\_groups, out\_channels). Model init moves the weight to [`device`](#max.nn.conv.Conv1D.device). ### `num_groups` {#max.nn.conv.Conv1D.num_groups} > num\_groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* Number of blocked connections from input channels to output channels. ### `padding` {#max.nn.conv.Conv1D.padding} > padding\*: [int](https://docs.python.org/3/library/functions.html#int)\* Controls the amount of padding applied before and after the input. ### `permute` {#max.nn.conv.Conv1D.permute} > permute\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* bool controls whether self.filter is permuted from PyTorch order to max order. PyTorch order is: (out\_channels, in\_channels / num\_groups, kernel\_size) Max API order: (kernel\_size, in\_channels / num\_groups, out\_channels). ### `stride` {#max.nn.conv.Conv1D.stride} > stride\*: [int](https://docs.python.org/3/library/functions.html#int)\* Controls the stride for the cross-correlation. ## `Conv1DV1` {#max.nn.conv.Conv1DV1} > *class* max.nn.conv.Conv1DV1(filter, bias=None, stride=1, padding=0, dilation=1, groups=1) A 1D convolution over an input signal composed of several input planes. Deprecated: Use Conv1D instead. ## Example ```python conv = nn.Conv1DV1( filter=filter_1d, bias=bias_1d, stride=1, padding=1 ) ``` **Parameters:** * **filter** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `bias` {#max.nn.conv.Conv1DV1.bias} > bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `dilation` {#max.nn.conv.Conv1DV1.dilation} > dilation\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1* ### `filter` {#max.nn.conv.Conv1DV1.filter} > filter\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `groups` {#max.nn.conv.Conv1DV1.groups} > groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1* ### `padding` {#max.nn.conv.Conv1DV1.padding} > padding\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 0* ### `stride` {#max.nn.conv.Conv1DV1.stride} > stride\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1* ## `Conv2DV1` {#max.nn.conv.Conv2DV1} > *class* max.nn.conv.Conv2DV1(filter, bias=None, stride=(1, 1), padding=(0, 0, 0, 0), dilation=(1, 1), groups=1) A 2D convolution over an input signal composed of several input planes. ## Example ```python conv = nn.Conv2DV1( filter=filter_2d, bias=bias_2d, stride=2, padding=1 ) output = conv(x) ``` **Parameters:** * **filter** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `bias` {#max.nn.conv.Conv2DV1.bias} > bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `dilation` {#max.nn.conv.Conv2DV1.dilation} > dilation\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1)* ### `filter` {#max.nn.conv.Conv2DV1.filter} > filter\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `groups` {#max.nn.conv.Conv2DV1.groups} > groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1* ### `padding` {#max.nn.conv.Conv2DV1.padding} > padding\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (0, 0, 0, 0)* ### `stride` {#max.nn.conv.Conv2DV1.stride} > stride\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1)* ## `Conv3D` {#max.nn.conv.Conv3D} > *class* max.nn.conv.Conv3D(depth, height, width, in\_channels, out\_channels, dtype, stride=1, padding=0, dilation=1, num\_groups=1, device=None, has\_bias=False, permute=False, name=None) A 3D convolution over an input signal composed of several input planes. ## Example ```python conv = nn.Conv3D( depth=, height=, width=, in_channels=, out_channels=, dtype=DType.float32, stride=1, padding=0, has_bias=False, name="conv3d_weight", device=DeviceRef.GPU(), ) ``` Initializes the Conv3D layer with weights and optional bias. **Parameters:** * **depth** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – kernel\_size\[0] * **height** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – kernel\_size\[1] * **width** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – kernel\_size\[2] * **in\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – number of channels in the input image. * **out\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – dimensionality of the output. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias. * **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Stride of the convolution. Default: 1 * **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Padding added to all six sides of the input. Default: 0 * **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Spacing between kernel elements. Default: 1 * **num\_groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of blocked connections from input channels to output channels. Default: 1. * **device** (`DeviceRef` `|` `None` ) – The target device for computation. Weights remain on CPU until moved during computation. * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) – Base name for weights (appended with `.weight` and `.bias` if applicable). * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer. Defaults to [`False`](https://docs.python.org/3/library/constants.html#False). * **permute** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `bias` {#max.nn.conv.Conv3D.bias} > bias\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The optional bias vector stored on CPU with shape (out\_channels,). Model init moves the bias to [`device`](#max.nn.conv.Conv3D.device) if present. ### `device` {#max.nn.conv.Conv3D.device} > device\*: DeviceRef | [None](https://docs.python.org/3/library/constants.html#None)\* The device where matrix operations are performed. ### `dilation` {#max.nn.conv.Conv3D.dilation} > dilation\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* Not implemented yet. Assuming dilation = 1 for now. ### `filter` {#max.nn.conv.Conv3D.filter} > filter\*: [Weight](../graph/Weight.md#max.graph.Weight)\* The weight matrix stored on CPU with shape (depth, height, width, in\_channels / num\_groups, out\_channels). Model init moves the weight to [`device`](#max.nn.conv.Conv3D.device). ### `num_groups` {#max.nn.conv.Conv3D.num_groups} > num\_groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* Not implemented yet. Assuming num\_groups = 1 for now. ### `padding` {#max.nn.conv.Conv3D.padding} > padding\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* Controls the amount of padding applied before and after the input for depth, height, and width dimensions. ### `permute` {#max.nn.conv.Conv3D.permute} > permute\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* bool controls whether self.filter is permuted from PyTorch order to max order. PyTorch order is: (out\_channels, in\_channels / num\_groups, depth, height, width) Max API order: (depth, height, width, in\_channels / num\_groups, out\_channels). ### `stride` {#max.nn.conv.Conv3D.stride} > stride\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* Controls the stride for the cross-correlation. ## `Conv3DV1` {#max.nn.conv.Conv3DV1} > *class* max.nn.conv.Conv3DV1(filter, bias=None, stride=(1, 1, 1), padding=(0, 0, 0, 0, 0, 0), dilation=(1, 1, 1), groups=1) A 3D convolution over an input signal composed of several input planes. Deprecated: Use Conv3D instead. ## Example ```python conv = nn.Conv3DV1( filter=filter_3d, bias=bias_3d, stride=1, padding=1 ) ``` **Parameters:** * **filter** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `bias` {#max.nn.conv.Conv3DV1.bias} > bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `dilation` {#max.nn.conv.Conv3DV1.dilation} > dilation\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1, 1)* ### `filter` {#max.nn.conv.Conv3DV1.filter} > filter\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `groups` {#max.nn.conv.Conv3DV1.groups} > groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1* ### `padding` {#max.nn.conv.Conv3DV1.padding} > padding\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (0, 0, 0, 0, 0, 0)* ### `stride` {#max.nn.conv.Conv3DV1.stride} > stride\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1, 1)* --- ## conv ## Structs * [​`ConvDirectNHWC`](./ConvDirectNHWC): Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height. * [​`Naive2dConvolution`](./Naive2dConvolution): Struct wrapper for naive 2d convolution implementation. ## Functions * [​`accumulate_wo_tile_1d`](./accumulate_wo_tile_1d): Update one row in the output for a given (c, f) tile. * [​`accumulate_wo_tile_2d`](./accumulate_wo_tile_2d): * [​`accumulate_wo_tile_3d`](./accumulate_wo_tile_3d): * [​`check_cudnn_error`](./check_cudnn_error): * [​`conv1d_update_wo_tile`](./conv1d_update_wo_tile): * [​`conv2d_gpu_naive_nhwc_rscf`](./conv2d_gpu_naive_nhwc_rscf): * [​`conv2d_update_wo_tile`](./conv2d_update_wo_tile): * [​`conv3d_gpu_naive_ndhwc_qrscf`](./conv3d_gpu_naive_ndhwc_qrscf): * [​`conv3d_update_wo_tile`](./conv3d_update_wo_tile): * [​`conv_cudnn`](./conv_cudnn): * [​`conv_gpu`](./conv_gpu): * [​`conv_nhwc_direct`](./conv_nhwc_direct): * [​`conv_shape`](./conv_shape): Compute the output shape of a `conv` operation, and assert the inputs are compatible. * [​`pack_conv_filter_shape`](./pack_conv_filter_shape): Compute the output shape of convolution filter packing. * [​`pack_filter`](./pack_filter): This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes. * [​`pack_filter_shape`](./pack_filter_shape): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. * [​`pack_filter_shape_impl`](./pack_filter_shape_impl): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. --- ## conv_cudnn `conv_cudnn[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType](input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output: UnsafePointer[SIMD[output_type, 1]], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], num_groups: Int, ctx: DeviceContext)` --- ## conv_gpu `conv_gpu[input_rank: Int, filter_rank: Int, input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](input: NDBuffer[input_type, input_rank, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, filter_rank, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, input_rank, MutableAnyOrigin, output_dim], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], padding: IndexList[(input_rank + -2)], num_groups: Int, ctx: DeviceContext)` --- ## conv_nhwc_direct `conv_nhwc_direct[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_info_static: ConvInfoStatic[(input_rank + -2)], lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], output: NDBuffer[output_type, input_rank, origin, output_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int)` --- ## conv_shape `conv_shape[input_rank: Int, filter_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, filter_rank, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin], num_groups_scalar: SIMD[dtype, 1]) -> IndexList[input_rank]` Compute the output shape of a `conv` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​filter\_rank (`Int`): Rank of the filter tensor. * ​input\_type (`DType`): Type of the input tensor. * ​filter\_type (`DType`): Type of the filter tensor. * ​strides\_type (`DType`): Type of the strides tensor. * ​dilations\_type (`DType`): Type of the dilations tensor. * ​paddings\_type (`DType`): Type of the paddings tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run ssynchronouslysing a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​filter\_buf (`NDBuffer[filter_type, filter_rank, origin]`): The filter tensor. * ​strides\_buf (`NDBuffer[strides_type, 1, origin]`): The strides tensor. * ​dilations\_buf (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor. * ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings tensor. * ​num\_groups\_scalar (`SIMD[dtype, 1]`): The num\_groups scalar. **Returns:** The output shape. --- ## conv_transpose ## Structs * [​`ConvTransposedPacked`](./ConvTransposedPacked): ## Functions * [​`accumulate_wo_tile`](./accumulate_wo_tile): * [​`conv_transpose_naive`](./conv_transpose_naive): Implements the ConvTranspose operator from the MO spec. * [​`conv_transpose_shape`](./conv_transpose_shape): Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible. * [​`conv_transposed`](./conv_transposed): * [​`get_num_partitions`](./get_num_partitions): Partition the worload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. * [​`get_partition`](./get_partition): * [​`pack_filter`](./pack_filter): This packs the filter form RSFC to FRSCf. * [​`pack_filter_shape`](./pack_filter_shape): Compute the output shape of transposed convolution filter packing. * [​`update_w_tile_2d`](./update_w_tile_2d): * [​`update_w_tile_3d`](./update_w_tile_3d): --- ## conv_transpose_naive `conv_transpose_naive[type: DType](output: NDBuffer[type, 5, MutableAnyOrigin], input: NDBuffer[type, 5, MutableAnyOrigin], filter: NDBuffer[type, 5, MutableAnyOrigin], stride: IndexList[3], dilation: IndexList[3], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])` Implements the ConvTranspose operator from the MO spec. **Parameters:** * ​type (`DType`): Type of the input, output, and kernel tensors. **Args:** * ​output (`NDBuffer[type, 5, MutableAnyOrigin]`): Output data tensor that contains the result of the convolution. * ​input (`NDBuffer[type, 5, MutableAnyOrigin]`): Input data tensor from previous layer, with size of (N x H x W x C), where N is the batch size, C is the number of channels, and H and W are the height and width. * ​filter (`NDBuffer[type, 5, MutableAnyOrigin]`): The weight (kernel) tensor, with size of (kH x kW x M/groups x C), where C is the number of channels, kH and kW are the height and width of the kernel, and M is the number of feature maps. * ​stride (`IndexList[3]`): Stride along each spatial axis. * ​dilation (`IndexList[3]`): Dilation value along each spatial axis of the filter. * ​pad\_d (`IndexList[2]`): Padding in depth dimension. * ​pad\_h (`IndexList[2]`): Padding in height dimension. * ​pad\_w (`IndexList[2]`): Padding in width dimension. --- ## conv_transpose_shape `conv_transpose_shape[input_rank: Int, kernel_rank: Int, type: DType, strides_type: DType, dilations_type: DType, pads_type: DType, output_pads_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, input_rank, origin], kernel: NDBuffer[type, kernel_rank, origin], strides: NDBuffer[strides_type, 1, origin], dilations: NDBuffer[dilations_type, 1, origin], pads: NDBuffer[pads_type, 1, origin], output_pads: NDBuffer[output_pads_type, 1, origin]) -> IndexList[input_rank]` Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​kernel\_rank (`Int`): Rank of the kernel tensor. * ​type (`DType`): Element type of the input and kernel tensor. * ​strides\_type (`DType`): Element type of the strides tensor. * ​dilations\_type (`DType`): Element type of the dilations tensor. * ​pads\_type (`DType`): Element type of the pads tensor. * ​output\_pads\_type (`DType`): Element type of the output\_pads tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[type, input_rank, origin]`): The input tensor. * ​kernel (`NDBuffer[type, kernel_rank, origin]`): The kernel tensor. * ​strides (`NDBuffer[strides_type, 1, origin]`): The strides tensor. * ​dilations (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor. * ​pads (`NDBuffer[pads_type, 1, origin]`): The paddings tensor. * ​output\_pads (`NDBuffer[output_pads_type, 1, origin]`): The output paddings tensor. **Returns:** The output shape. --- ## conv_transposed `conv_transposed[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](output: NDBuffer[output_type, input_rank, origin, output_shape], input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])` --- ## conv_utils ## Aliases ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None` ### `elementwise_simd_epilogue_type` `alias elementwise_simd_epilogue_type = fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None` ## Structs * [​`ConvAlgorithm`](./ConvAlgorithm): * [​`ConvInfoStatic`](./ConvInfoStatic): * [​`ConvPartition`](./ConvPartition): Work range for a partition. * [​`ConvShape`](./ConvShape): A shape struct describing the convolution dimensions. ## Functions * [​`align_down_residual`](./align_down_residual): Returns the remainder after aligning down value to alignment. * [​`append_shape`](./append_shape): Append input shape by inserting `last2nd` and `last` at the end. * [​`extend_shape`](./extend_shape): Extend input shape by inserting `first` and `last` at both ends. * [​`get_conv2d_shape`](./get_conv2d_shape): * [​`get_conv_num_partitions`](./get_conv_num_partitions): Partition the worload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. * [​`get_conv_num_tasks`](./get_conv_num_tasks): * [​`get_conv_shape`](./get_conv_shape): * [​`get_conv_tile_shape`](./get_conv_tile_shape): Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F. * [​`get_conv_tile_size`](./get_conv_tile_size): * [​`get_direct_conv_micro_kernel_height`](./get_direct_conv_micro_kernel_height): * [​`get_direct_conv_micro_kernel_width`](./get_direct_conv_micro_kernel_width): * [​`get_micro_kernel_shape`](./get_micro_kernel_shape): * [​`get_partition`](./get_partition): * [​`reorder_padding`](./reorder_padding): --- ## conv1d_update_wo_tile `conv1d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[rank], n: Int, wo: Int)` --- ## conv2d_gpu_naive_nhwc_rscf `conv2d_gpu_naive_nhwc_rscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 4, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 4, MutableAnyOrigin, output_dim], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2])` --- ## conv2d_update_wo_tile `conv2d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, howo: IndexList[2])` --- ## conv3d_gpu_naive_ndhwc_qrscf `conv3d_gpu_naive_ndhwc_qrscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 5, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 5, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 5, MutableAnyOrigin, output_dim], stride: IndexList[3], dilation: IndexList[3], padding: IndexList[3])` --- ## conv3d_update_wo_tile `conv3d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, dohowo: IndexList[3])` --- ## ConvAlgorithm `@register_passable(trivial)` `struct ConvAlgorithm` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Default` `alias Default = ConvAlgorithm(0)` ### `Direct` `alias Direct = ConvAlgorithm(2)` ### `Im2Col` `alias Im2Col = ConvAlgorithm(1)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## ConvDirectNHWC `struct ConvDirectNHWC[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]` Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height. If n\_ho\_wo is large enough to spill LLC, we may need to tile n\_ho\_wo as the outer most loop with a factor fit in LLC. Assume F is divisible at least by simd\_size. ## Fields * ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`): * ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`): * ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`): * ​conv\_shape (`ConvShape[(input_rank + -2)]`): * ​partition (`ConvPartition`): * ​cf\_tile\_size (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `packed_and_fully_static` `alias packed_and_fully_static = filter_packed if filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known()` ## Methods ### `run` `static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])` ### `is_new_c_accum` `is_new_c_accum(self, c_idx: Int) -> Bool` ### `update_output_tile_no_padding` `update_output_tile_no_padding[micro_kernel_height: Int, micro_kernel_width: Int, c_fully_cached: Bool, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int, output_flat_coord: Int)` ### `output_space_flat_loop` `output_space_flat_loop[micro_kernel_f_size: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)` ### `output_space_loop` `output_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)` ### `output_space_loop_1d` `output_space_loop_1d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `output_space_loop_2d` `output_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `output_space_loop_3d` `output_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` --- ## ConvertibleFromPython Denotes a type that can attempt construction from a read-only Python object. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self: _Self, obj: PythonObject)` Attempt to construct an instance of this object from a read-only Python value. **Args:** * ​obj (`PythonObject`): The Python object to convert from. **Raises:** If conversion was not successful. ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. --- ## ConvInfoStatic `struct ConvInfoStatic[rank: Int]` ## Fields * ​pad (`DimList`): * ​stride (`DimList`): * ​dilation (`DimList`): * ​num\_groups (`Dim`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` `__init__(out self, pad: DimList, stride: DimList, dilation: DimList, num_groups: Dim)` `__init__(out self, pad: DimList, stride: DimList, dilation: DimList, input_c: Dim, filter_c: Dim)` ### `all_known` `all_known(self) -> Bool` ### `pad_left` `pad_left(self) -> Int` ### `pad_bottom` `pad_bottom(self) -> Int` ### `strides` `strides(self) -> IndexList[2]` ### `dilations` `dilations(self) -> IndexList[2]` --- ## ConvPartition `@register_passable(trivial)` `struct ConvPartition` Work range for a partition. ## Fields * ​ng\_offset (`Int`): * ​ng\_size (`Int`): * ​f\_offset (`Int`): * ​f\_size (`Int`): * ​ho\_or\_howo\_offset (`Int`): * ​ho\_or\_howo\_size (`Int`): * ​c\_offset (`Int`): * ​c\_size (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `empty` `empty(self) -> Bool` --- ## ConvShape `@register_passable(trivial)` `struct ConvShape[rank: Int]` A shape struct describing the convolution dimensions. ## Fields * ​n (`Int`): * ​input\_dims (`IndexList[rank]`): * ​output\_dims (`IndexList[rank]`): * ​filter\_dims (`IndexList[rank]`): * ​c (`Int`): * ​f (`Int`): * ​stride (`IndexList[rank]`): * ​dilation (`IndexList[rank]`): * ​pad\_d (`IndexList[2]`): * ​pad\_h (`IndexList[2]`): * ​pad\_w (`IndexList[2]`): * ​num\_groups (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `d` `d(self) -> Int` Input depth. ### `h` `h(self) -> Int` Input height. ### `w` `w(self) -> Int` Input width. ### `do` `do(self) -> Int` Output depth. ### `ho` `ho(self) -> Int` Output height. ### `wo` `wo(self) -> Int` Output width. ### `q` `q(self) -> Int` Filter window depth. ### `r` `r(self) -> Int` Filter window height. ### `s` `s(self) -> Int` Filter windown width. ### `filter_window_flat_size` `filter_window_flat_size(self) -> Int` ### `input_image_flat_size` `input_image_flat_size(self) -> Int` ### `output_image_flat_size` `output_image_flat_size(self) -> Int` ### `output_space_dims` `output_space_dims(self) -> IndexList[rank]` ### `output_flat_coord_to_input_offset` `output_flat_coord_to_input_offset(self, n: Int, output_flat_coord: Int) -> Int` ### `matmul_M` `matmul_M(self) -> Int` ### `matmul_N` `matmul_N(self) -> Int` ### `matmul_K` `matmul_K(self) -> Int` ### `padded` `padded(self) -> Bool` ### `c_per_group` `c_per_group(self) -> Int` Returns the number of channels per group. Channel count must be divisible by group size. ### `f_per_group` `f_per_group(self) -> Int` Returns the number of filters per group. Filter count must be divisible by group size. ### `f_to_group` `f_to_group(self, f_idx: Int) -> Int` Given a global filter idx, returns the group idx of the group the filter belongs to. ### `c_to_group` `c_to_group(self, c_idx: Int) -> Int` Given a global channel idx, returns the group idx of the group the channel belongs to. ### `f_in_group` `f_in_group(self, f_idx: Int) -> Int` Given a global filter idx, returns the offset of the filter in its group. ### `c_in_group` `c_in_group(self, c_idx: Int) -> Int` Given a global channel idx, returns the offset of the channel in its group. --- ## ConvTransposedPacked `struct ConvTransposedPacked[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]` ## Fields * ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`): * ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`): * ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`): * ​conv\_shape (`ConvShape[(input_rank + -2)]`): * ​partition (`ConvPartition`): * ​cf\_tile\_size (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `run` `static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])` ### `input_space_loop` `input_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)` ### `input_space_loop_2d` `input_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `input_space_loop_3d` `input_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)` ### `apply_epilogue` `apply_epilogue(self, n: Int, g: Int)` --- ## coord_transform `coord_transform[mode: CoordinateTransformationMode](out_coord: Int, in_dim: Int, out_dim: Int, scale: SIMD[float32, 1]) -> SIMD[float32, 1]` --- ## CoordinateTransformationMode `struct CoordinateTransformationMode` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `AlignCorners` `alias AlignCorners = CoordinateTransformationMode(1)` ### `Asymmetric` `alias Asymmetric = CoordinateTransformationMode(2)` ### `HalfPixel` `alias HalfPixel = CoordinateTransformationMode(0)` ### `HalfPixel1D` `alias HalfPixel1D = CoordinateTransformationMode(3)` ## Methods ### `__init__` `@implicit` `__init__(out self, value: Int)` ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## copy `copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from local memory (registers) to SRAM (shared memory). This function performs a synchronous copy operation from register memory to shared memory in a GPU context, distributing the workload across multiple threads for parallel execution. It's particularly useful for transferring processed data from registers to shared memory for inter-thread communication. Example: ```mojo from layout import LayoutTensor, Layout var register_data = LayoutTensor[DType.float32, Layout((16, 16)), address_space=AddressSpace.LOCAL]() var shared_data = LayoutTensor[DType.float32, Layout((16, 16)), address_space=AddressSpace.SHARED]() # Process data in registers # ... # Copy processed data to shared memory for inter-thread communication copy[Layout((8, 8))](shared_data, register_data) ``` Performance: * Distributes the copy workload across multiple threads for parallel execution. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. * Optimized for transferring data from registers to shared memory. * On AMD GPUs, the `row_major` parameter can be used to match the memory access pattern used during prefetching from DRAM to registers. Notes: * The destination tensor must be in `SHARED` address space (SRAM). * The source tensor must be in `LOCAL` address space (registers). * This function is particularly useful in GPU kernels for sharing processed data between threads in the same block. * The `row_major` parameter is specifically designed for AMD GPUs when using a prefetching pattern from DRAM to SRAM via registers. **Constraints:** * Destination tensor must be in SHARED address space. * Source tensor must be in LOCAL address space. * For optimal performance, the thread layout should match the memory access patterns of the tensors. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for the operation. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. * ​row\_major (`Bool`): Whether to use row-major ordering for the copy operation. This is particularly relevant when prefetching from DRAM to SRAM via registers on AMD GPUs. Defaults to False. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers). --- ## copy_dram_to_local `copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}))` Efficiently copy data from global memory (DRAM) to registers for AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. Notes: * The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility. * This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency. **Constraints:** * Only supported on AMD GPUs. * The destination element layout size must match the SIMD width. * Source fragments must be rank 2 with known dimensions. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM) to be copied. * ​src\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which src is derived. This is used to construct the buffer descriptor required by AMD's `buffer_load` intrinsic. * ​offset (`OptionalReg[UInt]`): The offset in the global memory. `copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bounds: SIMD[uint32, 1])` Efficiently copy data from global memory (DRAM) to registers for AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic to efficiently transfer data from global memory to registers while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. Notes: * The offset calculation method significantly impacts performance. Current implementation optimizes for throughput over flexibility. * This function is particularly useful for prefetching data into registers before performing computations, reducing memory access latency. **Constraints:** * Only supported on AMD GPUs. * The destination element layout size must match the SIMD width. * Source fragments must be rank 2 with known dimensions. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space). * ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator. * ​bounds (`SIMD[uint32, 1]`): Bounds of the buffer, based on the ptr of the src\_iter. `copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Efficiently copy data from global memory (DRAM) to registers. This function implements an optimized memory transfer operation from global memory to register memory. It distributes the copy operation across multiple threads for maximum throughput while handling bounds checking for safety. **Constraints:** * The source tensor must be in GLOBAL address space (DRAM). * The destination tensor must be in LOCAL address space (registers). * Both tensors must have compatible data types. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM). --- ## copy_dram_to_sram `copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. This function performs a synchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, distributing the workload across multiple threads for parallel execution. It uses thread affinity mapping to ensure efficient work distribution and supports vectorized memory operations for optimal performance. Example: ```mojo from layout import LayoutTensor, Layout var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() # Copy data using a 2D thread layout with 8x8 threads copy_dram_to_sram[Layout((8, 8))](shared_data, global_data) ``` Performance: * Distributes the copy workload across multiple threads for parallel execution. * Supports vectorized loads and stores for better memory throughput. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. * Thread affinity mapping ensures efficient work distribution. * For masked tensors, performs bounds checking to handle edge cases correctly. Notes: * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. * This function is synchronous, meaning all threads must complete their copy operations before proceeding. * For optimal performance, the thread layouts should match the memory access patterns of the tensors. * This function is particularly useful in GPU kernels for loading data from global memory to shared memory for faster access. **Constraints:** * Source and destination tensors must have the same data type. * Source tensor must be in GENERIC or GLOBAL address space. * Destination tensor must be in SHARED address space. * For non-masked tensors, the fragment sizes must match. **Parameters:** * ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. * ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the destination tensor. Defaults to the same as `src_thread_layout` if not specified. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of `src_thread_layout`. * ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or `WARP`). Defaults to `ThreadScope.BLOCK`, where all threads in a block participate. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). `copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)` Efficiently copy data from global memory (DRAM) to shared memory (SRAM) on AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's `buffer_load` intrinsic to efficiently transfer data while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. **Parameters:** * ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor across threads. This determines how the workload is divided among participating threads. * ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor across threads. Defaults to the same layout as `src_thread_layout`. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling pattern to apply when distributing the destination tensor. This can improve memory access patterns and reduce bank conflicts. Defaults to None (no swizzling). * ​num\_threads (`Int`): The total number of threads participating in the copy operation. Defaults to the size of `src_thread_layout`. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory (SRAM). * ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator in global memory (DRAM) to be copied. * ​bound (`Int`): The bound of the source tensor iterator. `copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)` Synchronously copy data from DRAM to SRAM using a unified thread layout for AMD GPUs. This is a convenience wrapper around the more general `copy_dram_to_sram()` function that uses the same layout for both source and destination tensors. It's specifically designed for AMD GPUs where the buffer\_load intrinsic requires the original base tensor. Performance: * Simplifies API usage when the same thread layout is appropriate for both source and destination tensors. * Optimized for AMD GPUs using buffer\_load intrinsics for efficient memory transfers. * Distributes the copy workload across multiple threads for parallel execution. Notes: * This function is only supported on AMD GPUs. * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of thread\_layout. * ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or `WARP`). Defaults to `BLOCK`, where all threads in a block participate. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator, which must be in global or generic memory (DRAM). * ​bound (`Int`): The bound of the source tensor iterator. `copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from DRAM to SRAM using a unified thread layout. This is a convenience wrapper around the more general `copy_dram_to_sram()` function that uses the same layout for both source and destination tensors. It simplifies the API for the common case where the same thread distribution pattern works well for both tensors. Example: ```mojo from layout import LayoutTensor, Layout var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() # Copy data using a 2D thread layout with 8x8 threads copy_dram_to_sram[Layout((8, 8))](shared_data, global_data) ``` Performance: * Simplifies API usage when the same thread layout is appropriate for both source and destination tensors. * Distributes the copy workload across multiple threads for parallel execution. * Supports vectorized loads and stores for better memory throughput. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. Notes: * The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM). * The destination tensor must be in `SHARED` address space (SRAM). * Both tensors must have the same data type. * This function is synchronous, meaning all threads must complete their copy operations before proceeding. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of `thread_layout`. * ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or `WARP)`. Defaults to `ThreadScope.BLOCK`, where all threads in a block participate. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). --- ## copy_dram_to_sram_async `copy_dram_to_sram_async[src_thread_layout: Layout, dst_thread_layout: Layout, swizzle: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = src_thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) in a GPU context, using NVIDIA's cp.async hardware mechanism. It distributes the workload across multiple threads and allows computation to overlap with memory transfers for improved performance. Example: ```mojo from layout import LayoutTensor, Layout var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() # Asynchronously copy data using thread layouts copy_dram_to_sram_async[Layout((8, 8)), Layout((8, 8))](shared_data, global_data) # Perform other computations while the copy is in progress # Wait for the asynchronous copy to complete async_copy_wait_all() ``` Performance: * Performs asynchronous transfers, allowing computation to overlap with memory operations. * Distributes the copy workload across multiple threads for parallel execution. * Can use swizzling to optimize memory access patterns and reduce bank conflicts. * Supports different cache eviction policies to optimize memory hierarchy usage. * For masked tensors, performs bounds checking to handle edge cases correctly. Notes: * This function requires NVIDIA GPUs with `cp.async` support (compute capability 8.0+). * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. * This function is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. * The maximum size of each element that can be copied is 16 bytes. **Constraints:** * Requires NVIDIA GPUs with cp.async support (compute capability 8.0+). * Source tensor must be in `GENERIC` or `GLOBAL` address space. * Destination tensor must be in `SHARED` address space. * Both tensors must have the same data type. * Element size must be 4, 8, or 16 bytes. **Parameters:** * ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. * ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the destination tensor. * ​swizzle (`Bool`): Whether to apply swizzling to the destination indices to reduce bank conflicts. Defaults to False. * ​fill (`Fill`): Fill policy for handling out-of-bounds accesses. Options include: * `Fill.NONE`: No special handling (default). * `Fill.ZERO`: Fill out-of-bounds elements with zeros. * ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data. Options include: * `CacheEviction.EVICT_NORMAL`: Normal eviction (default). * `CacheEviction.EVICT_FIRST`: Evict data after first use. * `CacheEviction.EVICT_LAST`: Keep data in cache until last use. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of src\_thread\_layout. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). `copy_dram_to_sram_async[thread_layout: Layout, swizzle: Bool = False, masked: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronous copy from DRAM to SRAM with thread affinity mapping. This function performs an asynchronous memory transfer from DRAM (global memory) to SRAM (shared memory) using the specified thread layout for distribution. Notes: This is a convenience wrapper around the more general `copy_dram_to_sram_async()` function, using the same thread layout for both source and destination. **Parameters:** * ​thread\_layout (`Layout`): The layout used to distribute work across threads. * ​swizzle (`Bool`): Whether to apply memory access swizzling for better performance. * ​masked (`Bool`): Whether the copy operation should use masking. * ​fill (`Fill`): Fill policy for uninitialized memory regions. * ​eviction\_policy (`CacheEviction`): Cache eviction policy to use during the transfer. * ​num\_threads (`Int`): Number of threads to use for the operation, defaults to the size of `thread_layout`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination tensor in SRAM. * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source tensor in DRAM. --- ## copy_local_to_dram `copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Efficiently copy data from registers (LOCAL) to global memory (DRAM). This function implements a high-performance memory transfer operation from register memory to global memory. It distributes the copy operation across multiple threads for maximum throughput while handling bounds checking for safety. **Constraints:** * The source tensor must be in LOCAL address space (registers). * The destination tensor must be in GENERIC or GLOBAL address space (DRAM). * Both tensors must have compatible data types. **Parameters:** * ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL) to be copied. `copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dst_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Efficiently copy data from registers (LOCAL) to global memory (DRAM) on AMD GPUs. This function implements an optimized memory transfer operation specifically for AMD GPU architectures. It utilizes the hardware's buffer\_store intrinsic to efficiently transfer data from registers to global memory while handling bounds checking. The function distributes the copy operation across multiple threads for maximum throughput. Notes: * This function is particularly useful for writing computed results from registers back to global memory with minimal latency. * The offset calculation is optimized for performance rather than flexibility. **Constraints:** * Only supported on AMD GPUs. * Destination tensor must be in GLOBAL address space. * Source tensor must be in LOCAL address space. * Data types must match between source and destination tensors. **Parameters:** * ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor across threads. This determines how the workload is divided among participating threads. * ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or `WARP` level. `BLOCK` scope involves all threads in a thread block, while `WARP` scope restricts operations to threads within the same warp. Defaults to `ThreadScope.BLOCK`. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL address space) to be copied. * ​dst\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which dst is derived. This is used to construct the buffer descriptor required by AMD's `buffer_store` intrinsic. --- ## copy_local_to_local `copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data between local memory (register) tensors with type conversion. This function performs a synchronous copy operation between register tensors in a GPU context, with support for converting from float32 to half-precision formats (bfloat16/float16). It's particularly optimized for specific tensor layouts commonly used in matrix multiplication operations. Example: ```mojo from layout import LayoutTensor, Layout from layout.layout_tensor import copy_local_to_local var src_reg = LayoutTensor[DType.float32, Layout((16, 8)), address_space=AddressSpace.LOCAL]() var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)), address_space=AddressSpace.LOCAL]() # Process data in float32 registers # ... # Convert and copy to bfloat16 registers copy_local_to_local(dst_reg, src_reg) ``` Performance: * Optimized for specific 2D tensor layouts with contiguous inner dimensions. * Special fast path for 2D tensors with specific layouts used in matrix multiplication. * For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion between output fragments and input fragments with different layouts. * Falls back to element-wise copy for general cases. Notes: * Both source and destination tensors must be in `LOCAL` address space (registers). * This function currently only supports copying from float32 to half-precision formats. * For 2D tensors with stride\[1] == 1, a specialized fast path is used that's optimized for matrix multiplication patterns. * This function is particularly useful in GPU kernels for converting between different precision formats while keeping data in registers. **Constraints:** * Destination tensor must be in `LOCAL` address space. * Source tensor must be in `LOCAL` address space. * Destination tensor must have a half-precision floating-point data type. * Source tensor must have float32 data type. * Both tensors must have the same total size. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers) and have a half-precision floating-point data type (bfloat16 or float16). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers) and have float32 data type. --- ## copy_sram_to_dram `copy_sram_to_dram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), binary_op: OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from SRAM (shared memory) to DRAM (global memory). This function performs a synchronous memory transfer from SRAM (shared memory) to DRAM (global memory) using the specified thread layout for workload distribution. It supports optional swizzling for optimized memory access patterns and binary operations for combining data during the transfer. Example: ```mojo from layout import LayoutTensor, Layout var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() # Copy data using a 2D thread layout with 8x8 threads copy_sram_to_dram[Layout((8, 8))](global_data, shared_data) ``` Performance: * Distributes the copy workload across multiple threads for parallel execution. * Supports vectorized loads and stores for better memory throughput. * Can use swizzling to optimize memory access patterns. * Supports binary operations to combine data during transfer (e.g., for reduction operations). Notes: * The source tensor must be in `SHARED` address space (SRAM). * The destination tensor must be in `GENERIC` or `GLOBAL` address space (DRAM). * Supports FP32 to half-precision downcast during copy if needed. * Handles masked tensors with proper bounds checking. * This function is synchronous, meaning all threads must complete their copy operations before proceeding. **Constraints:** * Source tensor must be in SHARED address space with a static layout. * Destination tensor must be in GENERIC or GLOBAL address space. * For type conversion, only FP32 to half-precision is supported. * For vectorized copy with type conversion, both tensors must have element layouts matching the SIMD width of the destination type. **Parameters:** * ​thread\_layout (`Layout`): Layout defining how threads are organized for both source and destination. This determines how the workload is distributed among threads. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the source indices, which can improve memory access patterns and reduce bank conflicts. * ​num\_threads (`Int`): Total number of threads participating in the copy operation. Defaults to the size of thread\_layout. * ​binary\_op (`OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]`): Optional binary operation to apply during the copy, combining source data with existing destination data. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in global or generic memory (DRAM). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM). --- ## copy_sram_to_local `copy_sram_to_local[src_warp_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Synchronously copy data from SRAM (shared memory) to local memory. This function performs a synchronous memory transfer from SRAM (shared memory) to local memory (registers) using the specified thread layout for workload distribution. Example: ```mojo from layout import LayoutTensor, Layout var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() var local_data = LayoutTensor[DType.float32, Layout((4, 4)), address_space=AddressSpace.LOCAL]() # Copy data using a thread layout with 8 threads copy_sram_to_local[Layout(8)](local_data, shared_data) ``` Performance: * Distributes the copy workload across multiple threads for parallel execution. * Optimized for transferring data from shared memory to registers. * Supports optional axis-specific distribution for specialized access patterns. **Constraints:** * The source tensor must be in SHARED address space (SRAM). * The destination tensor must be in LOCAL address space (registers). * Both tensors must have the same data type. **Parameters:** * ​src\_warp\_layout (`Layout`): Layout defining how threads are organized for the source tensor. This determines how the workload is distributed among threads. * ​axis (`OptionalReg[Int]`): Optional parameter specifying which axis to distribute along. When provided, distribution happens along the specified axis. When None (default), distribution uses the standard layout pattern. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers). * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM). --- ## copy_to_slice `copy_to_slice[type: DType, start_type: DType, end_type: DType, step_type: DType, in_rank: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](buffer: NDBuffer[type, in_rank, origin], in_slice: NDBuffer[type, in_rank, origin], start: NDBuffer[start_type, 1, origin], end: NDBuffer[end_type, 1, origin], step: NDBuffer[step_type, 1, origin], context: DeviceContextPtr = DeviceContextPtr())` --- ## Copyable The Copyable trait denotes a type whose value can be copied. Example implementing the `Copyable` trait on `Foo` which requires the `__copyinit__` method: ```mojo struct Foo(Copyable): var s: String @implicit fn __init__(out self, s: String): self.s = s fn __copyinit__(out self, other: Self): print("copying value") self.s = other.s ``` You can now copy objects inside a generic function: ```mojo fn copy_return[T: Copyable](foo: T) -> T: var copy = foo return copy var foo = Foo("test") var res = copy_return(foo) ``` ```plaintext copying value ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. --- ## copysign `copysign[dtype: DType, width: Int, //](magnitude: SIMD[dtype, width], sign: SIMD[dtype, width]) -> SIMD[dtype, width]` Returns a value with the magnitude of the first operand and the sign of the second operand. **Constraints:** The type of the input must be numeric. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​magnitude (`SIMD[dtype, width]`): The magnitude to use. * ​sign (`SIMD[dtype, width]`): The sign to copy. **Returns:** Copies the sign from sign to magnitude. --- ## coroutine Implements classes and methods for coroutines. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `AnyCoroutine` `alias AnyCoroutine = !co.routine` ## Structs * [​`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine): Represents a coroutine. * [​`RaisingCoroutine`](/mojo/stdlib/builtin/coroutine/RaisingCoroutine): Represents a coroutine that can raise. --- ## Coroutine `@register_passable` `struct Coroutine[type: AnyType, origins: origin.set]` Represents a coroutine. Coroutines can pause execution saving the state of the program (including values of local variables and the location of the next instruction to be executed). When the coroutine is resumed, execution continues from where it left off, with the saved state restored. ## Parameters * ​type (`AnyType`): Type of value returned upon completion of the coroutine. * ​origins (`origin.set`): The origin of the coroutine's captures. ## Implemented traits `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(handle: !co.routine) -> Self` Construct a coroutine object from a handle. **Args:** * ​handle (`!co.routine`): The init handle. ### `__await__` `__await__(owned self, out result: type)` Suspends the current coroutine until the coroutine is complete. **Returns:** The coroutine promise. ### `force_destroy` `force_destroy(owned self)` Destroy the coroutine object. --- ## cos `cos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `cos` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `cos` of the input. --- ## cosh `cosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `cosh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `cosh` of the input. --- ## cosize `cosize(l: Layout) -> Int` Returns the size of the memory region spanned by the layout. This is a standalone function equivalent to the Layout.cosize() method. **Args:** * ​l (`Layout`): The layout to calculate the cosize for. **Returns:** The size of the memory region required by the layout. --- ## count_leading_zeros `count_leading_zeros(val: Int) -> Int` Counts the number of leading zeros of an integer. **Args:** * ​val (`Int`): The input value. **Returns:** The number of leading zeros of the input. `count_leading_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Counts the per-element number of leading zeros in a SIMD vector. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `DType` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` contains the number of leading zeros at position `i` of the input value. --- ## count_trailing_zeros `count_trailing_zeros(val: Int) -> Int` Counts the number of trailing zeros for an integer. **Args:** * ​val (`Int`): The input value. **Returns:** The number of trailing zeros of the input. `count_trailing_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Counts the per-element number of trailing zeros in a SIMD vector. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` contains the number of trailing zeros at position `i` of the input value. --- ## counter Defines the `Counter` type. You can import these APIs from the `collections` package. For example: ```mojo from collections import Counter ``` ## Structs * [​`Counter`](/mojo/stdlib/collections/counter/Counter): A container for counting hashable items. * [​`CountTuple`](/mojo/stdlib/collections/counter/CountTuple): A tuple representing a value and its count in a Counter. --- ## Counter `struct Counter[V: KeyElement]` A container for counting hashable items. The value type must be specified statically, unlike a Python Counter, which can accept arbitrary value types. The value type must implement the `KeyElement` trait, as its values are stored in the dictionary as keys. Usage: ```mojo from collections import Counter var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c") print(c["a"]) # prints 3 print(c["b"]) # prints 2 ``` ## Parameters * ​V (`KeyElement`): The value type to be counted. Currently must be KeyElement. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Create a new, empty Counter object. `__init__(out self, owned *values: V)` Create a new Counter from a list of values. Usage: ```mojo from collections import Counter var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c") print(c["a"]) # print 3 print(c["b"]) # print 2 ``` **Args:** * ​\*values (`V`): A list of values to count. `@implicit` `__init__(out self, items: List[V, hint_trivial_type])` Create a from an input iterable. Usage: ```mojo from collections import Counter var c = Counter[String](["a", "a", "a", "b", "b", "c", "d", "c", "c"]) print(c["a"]) # prints 3 print(c["b"]) # prints 2 ``` **Args:** * ​items (`List[V, hint_trivial_type]`): A list of items to count. ### `__bool__` `__bool__(self) -> Bool` Check if the Counter is empty or not. **Returns:** `False` if the Counter is empty, `True` otherwise. ### `__getitem__` `__getitem__(self, key: V) -> Int` Get the count of a key. **Args:** * ​key (`V`): The key to get the count of. **Returns:** The count of the key. ### `__setitem__` `__setitem__(mut self, value: V, count: Int)` Set a value in the keyword Counter by key. **Args:** * ​value (`V`): The value to associate with the specified count. * ​count (`Int`): The count to store in the Counter. ### `__neg__` `__neg__(self) -> Self` Substract from an empty Counter. Strips positive and zero counts, and flips the sign on negative counts. **Returns:** A new Counter with stripped counts and negative counts. ### `__pos__` `__pos__(self) -> Self` Return a shallow copy of the Counter, stripping non-positive counts. **Returns:** A shallow copy of the Counter. ### `__lt__` `__lt__(self, other: Self) -> Bool` Check if all counts are less than in the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are less than in the other Counter, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Check if all counts are less than or equal to the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are less than or equal to the other Counter, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if all counts agree. Missing counts are treated as zero. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if the two Counters are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Check if all counts disagree. Missing counts are treated as zero. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if the two Counters are not equal, False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Check if all counts are greater than in the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are greater than in the other Counter, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Check if all counts are greater than or equal to the other Counter. **Args:** * ​other (`Self`): The other Counter to compare to. **Returns:** True if all counts are greater than or equal to the other Counter, False otherwise. ### `__contains__` `__contains__(self, key: V) -> Bool` Check if a given key is in the dictionary or not. **Args:** * ​key (`V`): The key to check. **Returns:** True if there key exists in the dictionary, False otherwise. ### `__add__` `__add__(self, other: Self) -> Self` Add counts from two Counters. **Args:** * ​other (`Self`): The other Counter to add to this Counter. **Returns:** A new Counter with the counts from both Counters added together. ### `__sub__` `__sub__(self, other: Self) -> Self` Subtract counts, but keep only results with positive counts. **Args:** * ​other (`Self`): The other Counter to subtract from this Counter. **Returns:** A new Counter with the counts from the other Counter subtracted from this Counter. ### `__and__` `__and__(self, other: Self) -> Self` Intersection: keep common elements with the minimum count. **Args:** * ​other (`Self`): The other Counter to intersect with. **Returns:** A new Counter with the common elements and the minimum count of the two Counters. ### `__or__` `__or__(self, other: Self) -> Self` Union: keep all elements with the maximum count. **Args:** * ​other (`Self`): The other Counter to union with. **Returns:** A new Counter with all elements and the maximum count of the two Counters. ### `__iadd__` `__iadd__(mut self, other: Self)` Add counts from another Counter to this Counter. **Args:** * ​other (`Self`): The other Counter to add to this Counter. ### `__isub__` `__isub__(mut self, other: Self)` Subtract counts from another Counter from this Counter. **Args:** * ​other (`Self`): The other Counter to subtract from this Counter. ### `__iand__` `__iand__(mut self, other: Self)` Intersection: keep common elements with the minimum count. **Args:** * ​other (`Self`): The other Counter to intersect with. ### `__ior__` `__ior__(mut self, other: Self)` Union: keep all elements with the maximum count. **Args:** * ​other (`Self`): The other Counter to union with. ### `copy` `copy(self) -> Self` Create a new Counter by copying another Counter. **Returns:** A copy of the value. ### `fromkeys` `static fromkeys(keys: List[V, hint_trivial_type], value: Int) -> Self` Create a new Counter from a list of keys and a default value. **Args:** * ​keys (`List[V, hint_trivial_type]`): The keys to create the Counter from. * ​value (`Int`): The default value to associate with each key. **Returns:** A new Counter with the keys and default value. ### `__iter__` `__iter__(self) -> _DictKeyIter[V, Int, self._data]` Iterate over the keyword dict's keys as immutable references. **Returns:** An iterator of immutable references to the Counter values. ### `__len__` `__len__(self) -> Int` Returns the number of elements currently stored in the Counter. **Returns:** The number of elements in the Counter. ### `get` `get(self, value: V) -> Optional[Int]` Get a value from the counter. **Args:** * ​value (`V`): The value to search for in the Counter. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. `get(self, value: V, default: Int) -> Int` Get a value from the Counter. **Args:** * ​value (`V`): The value to search for in the counter. * ​default (`Int`): Default count to return. **Returns:** A copy of the value if it was present, otherwise default. ### `pop` `pop(mut self, value: V) -> Int` Remove a value from the Counter by value. **Args:** * ​value (`V`): The value to remove from the Counter. **Returns:** The value associated with the key, if it was in the Counter. **Raises:** "KeyError" if the key was not present in the Counter. `pop(mut self, value: V, owned default: Int) -> Int` Remove a value from the Counter by value. **Args:** * ​value (`V`): The value to remove from the Counter. * ​default (`Int`): Optionally provide a default value to return if the value was not found instead of raising. **Returns:** The value associated with the key, if it was in the Counter. If it wasn't, return the provided default value instead. **Raises:** "KeyError" if the key was not present in the Counter and no default value was provided. ### `keys` `keys(ref self) -> _DictKeyIter[V, Int, self_is_origin._data]` Iterate over the Counter's keys as immutable references. **Returns:** An iterator of immutable references to the Counter keys. ### `values` `values(ref self) -> _DictValueIter[V, Int, self_is_origin._data]` Iterate over the Counter's values as references. **Returns:** An iterator of references to the Counter values. ### `items` `items(self) -> _DictEntryIter[V, Int, self._data]` Iterate over the dict's entries as immutable references. **Returns:** An iterator of immutable references to the Counter entries. ### `clear` `clear(mut self)` Remove all elements from the Counter. ### `popitem` `popitem(mut self) -> CountTuple[V]` Remove and return an arbitrary (key, value) pair from the Counter. **Returns:** A CountTuple containing the key and value of the removed item. **Raises:** "KeyError" if the Counter is empty. ### `total` `total(self) -> UInt` Return the total of all counts in the Counter. **Returns:** The total of all counts in the Counter. ### `most_common` `most_common(self, n: UInt) -> List[CountTuple[V]]` Return a list of the `n` most common elements and their counts from the most common to the least. **Args:** * ​n (`UInt`): The number of most common elements to return. **Returns:** A list of the n most common elements and their counts. ### `elements` `elements(self) -> List[V]` Return an iterator over elements repeating each as many times as its count. **Returns:** An iterator over the elements in the Counter. ### `update` `update(mut self, other: Self)` Update the Counter, like `dict.update()` but add counts instead of replacing them. **Args:** * ​other (`Self`): The Counter to update this Counter with. ### `subtract` `subtract(mut self, other: Self)` Subtract count. Both inputs and outputs may be zero or negative. **Args:** * ​other (`Self`): The Counter to subtract from this Counter. --- ## CountTuple `struct CountTuple[V: KeyElement]` A tuple representing a value and its count in a Counter. ## Parameters * ​V (`KeyElement`): The value in the Counter. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, value: V, count: UInt)` Create a new CountTuple. **Args:** * ​value (`V`): The value in the Counter. * ​count (`UInt`): The count of the value in the Counter. ### `__getitem__` `__getitem__(self, idx: Int) -> Variant[V, Int]` Get an element in the tuple. **Args:** * ​idx (`Int`): The element to return. **Returns:** The value if idx is 0 and the count if idx is 1. ### `__lt__` `__lt__(self, other: Self) -> Bool` Compare two CountTuples by count, then by value. **Args:** * ​other (`Self`): The other CountTuple to compare to. **Returns:** True if this CountTuple is less than the other, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compare two CountTuples for equality. **Args:** * ​other (`Self`): The other CountTuple to compare to. **Returns:** True if the two CountTuples are equal, False otherwise. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. --- ## cp_async_bulk_commit_group `cp_async_bulk_commit_group()` Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group. This function commits all previously initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group. The cp.async.bulk instructions are used for asynchronous bulk memory transfers on NVIDIA GPUs. The function creates a synchronization point for bulk memory transfers, allowing better control over memory movement and synchronization between different stages of computation. Note: This functionality is only available on NVIDIA GPUs. Attempting to use this function on non-NVIDIA GPUs will result in a compile time error. --- ## cp_async_bulk_tensor_global_shared_cta `cp_async_bulk_tensor_global_shared_cta[src_type: AnyType, rank: Int, /, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])` Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. This function provides an efficient way to write data back from shared memory to global memory using TMA. It supports both rank-1 and rank-2 tensors and allows control over cache eviction policy. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The source memory must be properly aligned for TMA operations. * The TMA descriptor must be properly initialized before use. **Parameters:** * ​src\_type (`AnyType`): The data type of the source tensor elements. * ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2). * ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled in the cache hierarchy. Defaults to EVICT\_NORMAL. **Args:** * ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be copied to global memory. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. --- ## cp_async_bulk_tensor_reduce `cp_async_bulk_tensor_reduce[src_type: AnyType, rank: Int, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])` Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. This function performs an in-place reduction operation, combining data from shared memory with data in global memory using the specified reduction operation. The operation is performed asynchronously and uses TMA's tile mode for efficient memory access. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The source memory must be properly aligned for TMA operations. * The TMA descriptor must be properly initialized before use. * The reduction operation is performed atomically to ensure correctness. **Parameters:** * ​src\_type (`AnyType`): The data type of the source tensor elements. * ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2). * ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform. Supported operations are: "add", "min", "max", "inc", "dec", "and", "or", "xor". * ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled in the cache hierarchy. Defaults to `EVICT_NORMAL`. **Args:** * ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be reduced with the global memory data. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to operate on. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. --- ## cp_async_bulk_tensor_shared_cluster_global `cp_async_bulk_tensor_shared_cluster_global[dst_type: AnyType, mbr_type: AnyType, rank: Int](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank])` Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory. This function performs an asynchronous copy of tensor data using NVIDIA's Tensor Memory Access (TMA) mechanism. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization for efficient data movement. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure copy completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The memory barrier should be properly initialized before use. **Parameters:** * ​dst\_type (`AnyType`): The data type of the destination memory. * ​mbr\_type (`AnyType`): The data type of the memory barrier. * ​rank (`Int`): The dimensionality of the tensor (1, 2, or 3). **Args:** * ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor that contains metadata about the tensor layout and memory access patterns. * ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy operation across threads in the cluster. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. --- ## cp_async_bulk_tensor_shared_cluster_global_multicast `cp_async_bulk_tensor_shared_cluster_global_multicast[dst_type: AnyType, mbr_type: AnyType, rank: Int](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank], multicast_mask: SIMD[uint16, 1])` Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster. This function performs an optimized multicast copy operation where a single global memory read can be distributed to multiple CTAs' shared memories simultaneously, reducing memory bandwidth usage. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization. Notes: * This operation is asynchronous - use appropriate memory barriers to ensure copy completion. * Only supports rank-1 and rank-2 tensors. * Requires NVIDIA GPU with TMA support. * The memory barrier should be properly initialized before use. * The multicast\_mask must be properly configured based on cluster size and desired distribution. **Parameters:** * ​dst\_type (`AnyType`): The data type of the destination tensor elements. * ​mbr\_type (`AnyType`): The data type of the memory barrier. * ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2). **Args:** * ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied. Must be properly aligned according to TMA requirements. * ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout and memory access patterns. * ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy operation across threads in the cluster. * ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors, this is a single coordinate. For rank-2 tensors, this contains both row and column coordinates. * ​multicast\_mask (`SIMD[uint16, 1]`): A 16-bit bitmask where each bit corresponds to a CTA in the cluster. Set bits indicate which CTAs will receive a copy of the loaded data. This enables efficient data sharing across multiple CTAs. --- ## cp_async_bulk_wait_group `cp_async_bulk_wait_group[n: SIMD[int32, 1], read: Bool = True]()` Waits for completion of asynchronous bulk memory transfer groups. This function causes the executing thread to wait until a specified number of the most recent bulk async-groups are pending. It provides synchronization control for bulk memory transfers on NVIDIA GPUs. Note: This functionality is only available on NVIDIA GPUs. Attempting to use this function on non-NVIDIA GPUs will result in a compile time error. Example: ```mojo from gpu.sync import cp_async_bulk_wait_group # Wait until at most 2 async groups are pending cp_async_bulk_wait_group[2]() # Wait for all async groups to complete cp_async_bulk_wait_group[0]() ``` **Parameters:** * ​n (`SIMD[int32, 1]`): The number of most recent bulk async-groups allowed to remain pending. When n=0, waits for all prior bulk async-groups to complete. * ​read (`Bool`): If True, indicates that subsequent reads to the transferred memory are expected, enabling optimizations for read access patterns. Defaults to True. --- ## cp_async_k_major `cp_async_k_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout. This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator (TMA) hardware. It optimizes for K-major memory access patterns, which is particularly beneficial for certain tensor operations like matrix multiplications where the inner dimension (K) is accessed contiguously. The function automatically determines the optimal tile size and thread distribution based on the tensor shapes and hardware capabilities, leveraging TMA's efficient memory transfer mechanisms. Example: ```mojo from layout import LayoutTensor, Layout from layout.layout_tensor import cp_async_k_major from gpu.memory import async_copy_wait_all var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() # Copy data with K-major layout optimization cp_async_k_major[DType.float32](shared_data, global_data) # Wait for the asynchronous copy to complete async_copy_wait_all() ``` Performance: * Uses TMA hardware acceleration for optimal memory transfer performance. * Optimizes for K-major access patterns, which can significantly improve performance for certain tensor operations like matrix multiplications. * Performs asynchronous transfers, allowing computation to overlap with memory operations. * Automatically determines optimal tile sizes based on tensor dimensions. * Uses hardware-accelerated swizzling to reduce shared memory bank conflicts. Notes: * This function requires NVIDIA GPUs with TMA support (compute capability 9.0+). * The source tensor must be in GENERIC or GLOBAL address space (DRAM). * The destination tensor must be in SHARED address space (SRAM). * Both tensors must have the same data type. * This function is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. * K-major layout is particularly beneficial for matrix multiplication operations where the inner dimension (K) is accessed contiguously. **Constraints:** * Requires NVIDIA GPUs with TMA support (compute capability 9.0+). * Source tensor must be in GENERIC or GLOBAL address space. * Destination tensor must be in SHARED address space. * Both tensors must have the same data type. * Source and destination tensors must be 2D. **Parameters:** * ​type (`DType`): The data type of the tensor elements. * ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`. **Args:** * ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). --- ## cp_async_mn_major `cp_async_mn_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout. This function performs an asynchronous copy operation from global memory (DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator (TMA) hardware. It optimizes for MN-major memory access patterns, which is particularly beneficial for tensor operations where the outer dimensions (M, N) are accessed contiguously. The function automatically determines the optimal tile size and thread distribution based on the tensor shapes and hardware capabilities, leveraging TMA's efficient memory transfer mechanisms. Example: ```mojo from layout import LayoutTensor, Layout from layout.layout_tensor import cp_async_mn_major from gpu.memory import async_copy_wait_all var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() # Copy data with MN-major layout optimization cp_async_mn_major[DType.float32](shared_data, global_data) # Wait for the asynchronous copy to complete async_copy_wait_all() ``` Performance: * Uses TMA hardware acceleration for optimal memory transfer performance. * Optimizes for MN-major access patterns, which can significantly improve performance for certain tensor operations where outer dimensions are accessed contiguously. * Performs asynchronous transfers, allowing computation to overlap with memory operations. * Automatically determines optimal tile sizes based on tensor dimensions. * Uses hardware-accelerated swizzling to reduce shared memory bank conflicts. Notes: * This function requires NVIDIA GPUs with TMA support (compute capability 9.0+). * The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM). * The destination tensor must be in `SHARED` address space (SRAM). * Both tensors must have the same data type. * This function is asynchronous, so you must call [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/) or [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/) to ensure the copy has completed before using the data. * MN-major layout is particularly beneficial for operations where the outer dimensions are accessed contiguously, such as certain convolution operations. **Constraints:** * Requires NVIDIA GPUs with TMA support (compute capability 9.0+). * Source tensor must be in `GENERIC` or `GLOBAL` address space. * Destination tensor must be in `SHARED` address space. * Both tensors must have the same data type. * Source and destination tensors must be 2D. **Parameters:** * ​type (`DType`): The data type of the tensor elements. * ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`. **Args:** * ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM). * ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM). --- ## crd2idx `crd2idx(crd: IntTuple[origin], shape: IntTuple[origin]) -> Int` Map a logical coordinate to a linear index. This function converts a multi-dimensional coordinate to a linear index based on the shape. It uses default strides computed from the shape. **Args:** * ​crd (`IntTuple[origin]`): The coordinate tuple to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. **Returns:** The linear index corresponding to the coordinate. `crd2idx(crd: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> Int` Map a logical coordinate to a linear index with custom strides. This function converts a multi-dimensional coordinate to a linear index based on the shape and stride information. If no stride is provided, it computes default strides from the shape. The function handles various input combinations: * Tuple coordinates with tuple shapes and strides * Single integer coordinate with tuple shapes and strides * Single integer coordinate with single integer shape and stride Aborts: ``` - If coordinate and shape dimensions don't match. - If shape and stride dimensions don't match. - If input type combinations are invalid. ``` **Args:** * ​crd (`IntTuple[origin]`): The coordinate(s) to convert, can be a single value or a tuple of coordinates. * ​shape (`IntTuple[origin]`): The shape of the tensor/array, can be a single value or a tuple of dimensions. * ​\_stride (`IntTuple[origin]`): Optional custom strides, defaults to row-major strides if not provided. **Returns:** The linear index corresponding to the coordinate. --- ## crd2idx `crd2idx[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, crd_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0], out_type: DType = uint64](crd: RuntimeTuple[crd_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> SIMD[out_type, 1]` Converts multi-dimensional coordinates to a linear index. This function is the inverse of idx2crd, transforming a set of coordinates into a flat index based on the provided shape and stride information. This is essential for mapping multi-dimensional tensor elements to linear memory. **Parameters:** * ​crd\_t (`IntTuple[$2]`): Type of the coordinates. * ​shape\_t (`IntTuple[$1]`): Type of the shape. * ​stride\_t (`IntTuple[$0]`): Type of the stride. * ​out\_type (`DType`): The output data type for the index (default: uint64). **Args:** * ​crd (`RuntimeTuple[crd_t, element_type=element_type]`): The coordinates to convert. * ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array. * ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension. **Returns:** A scalar value representing the linear index corresponding to the given coordinates. --- ## Create a knowledge base with a text embedding model import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import SmallCards from '@site/src/components/SmallCards'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Requirements from '@site/src/components/Requirements'; import { requirementsNoGPU } from '@site/docs/max/requirements'; Text embeddings are rich numerical representations of text that power many modern natural language processing (NLP) applications. This tutorial shows you how to serve and interact with an embedding model using an OpenAI-compatible endpoint. Specifically, we'll use MAX to serve the [all-mpnet-base-v2](https://builds.modular.com/models/all-mpnet-base-v2/5B) model, which is a powerful transformer that excels at capturing semantic relationships in text. In this tutorial, you'll learn how to: - Set up a local embeddings server using the `all-mpnet-base-v2` model - Build a smart knowledge base system using semantic similarity - Implement document clustering and topic-based organization - Create robust search functionality using embeddings System requirements: ## Set up your environment Create a Python project to install our APIs and CLI tools: ## Serve the embedding model Now start serving the [all-mpnet-base-v2](https://builds.modular.com/models/all-mpnet-base-v2/5B) model locally using MAX: 1. Start a local endpoint for `all-mpnet-base-v2`: ```sh max serve --model-path=sentence-transformers/all-mpnet-base-v2 ``` This will create a server running the `all-mpnet-base-v2` embedding model on `http://localhost:8000/v1/embeddings`, an [OpenAI compatible endpoint](https://platform.openai.com/docs/api-reference/embeddings). The endpoint is ready when you see the URI printed in your terminal: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` 2. Send a curl request to see what kind of response we get back. With the server running in your first terminal, run the following command in the second terminal: ```sh curl http://localhost:8000/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "input": "Run an embedding model with MAX Serve!", "model": "sentence-transformers/all-mpnet-base-v2" }' ``` The following is the expected output. ```output {"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573, ``` The text has been shortened for brevity. This returns a numerical representation of the input text that can be used for semantic comparisons. Now that the endpoint is active and responsive, let's create an application that uses the embedding model and retrieves information. ## Build a knowledge base system Now, let's build a smart knowledge base using the `all-mpnet-base-v2` model. You'll create a system that can match user queries to relevant documentation and automatically organize content into topics. ### 1. Install dependencies Add the following libraries to your virtual environment: ```sh pip install numpy scikit-learn requests ``` ```sh uv pip install numpy scikit-learn requests ``` Add three new libraries to `magic`: ```sh magic add numpy scikit-learn requests ``` Change folders into your working directory: ```sh cd src/quickstart ``` These libraries help measure similarity of sentences and handle various computational tasks. The requests library enables API communication with the embeddings endpoint. ### 2. Implement the knowledge base system Now we will create a smart knowledge base system that can: - Process and store documents with their semantic embeddings - Search for relevant documents using natural language queries - Automatically organize content into topics using clustering - Suggest relevant topics based on user queries The system uses embeddings from the `all-mpnet-base-v2` model to understand the meaning of text, enabling semantic search and intelligent document organization. 1. Create a new Python file called `kb_system.py` in your working directory and add the following: ```python import numpy as np from sklearn.metrics.pairwise import cosine_similarity from sklearn.cluster import KMeans import requests from typing import List, Dict, Tuple from functools import lru_cache import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) class SmartKnowledgeBase: def __init__(self, endpoint: str = "http://localhost:8000/v1/embeddings"): self.endpoint = endpoint self.documents: List[str] = [] self.doc_titles: List[str] = [] self.embeddings: np.ndarray = None self.clusters: Dict[int, List[int]] = {} def _get_embedding(self, texts: List[str], max_retries: int = 3) -> np.ndarray: """Get embeddings with retry logic.""" for attempt in range(max_retries): try: response = requests.post( self.endpoint, headers={"Content-Type": "application/json"}, json={"input": texts, "model": "sentence-transformers/all-mpnet-base-v2"}, timeout=5 ).json() return np.array([item["embedding"] for item in response["data"]]) except Exception as e: if attempt == max_retries - 1: raise Exception(f"Failed to get embeddings after {max_retries} attempts: {e}") logger.warning(f"Attempt {attempt + 1} failed, retrying...") @lru_cache(maxsize=1000) def _get_embedding_cached(self, text: str) -> np.ndarray: """Cached version for single text embedding.""" return self._get_embedding([text])[0] def add_document(self, title: str, content: str): """Add a single document with title.""" self.doc_titles.append(title) self.documents.append(content) # Update embeddings if len(self.documents) == 1: self.embeddings = self._get_embedding([content]) else: self.embeddings = np.vstack([self.embeddings, self._get_embedding([content])]) # Recluster if we have enough documents if len(self.documents) >= 3: self._cluster_documents() def _cluster_documents(self, n_clusters: int = None): """Cluster documents into topics.""" if n_clusters is None: n_clusters = max(2, len(self.documents) // 5) n_clusters = min(n_clusters, len(self.documents)) kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(self.embeddings) self.clusters = {} for i in range(n_clusters): self.clusters[i] = np.where(kmeans.labels_ == i)[0].tolist() def search(self, query: str, top_k: int = 3) -> List[Tuple[str, str, float]]: """Find documents most similar to the query.""" query_embedding = self._get_embedding_cached(query) similarities = cosine_similarity([query_embedding], self.embeddings)[0] top_indices = np.argsort(similarities)[-top_k:][::-1] return [(self.doc_titles[i], self.documents[i], similarities[i]) for i in top_indices] def get_topic_documents(self, topic_id: int) -> List[Tuple[str, str]]: """Get all documents in a topic cluster.""" return [(self.doc_titles[i], self.documents[i]) for i in self.clusters.get(topic_id, [])] def suggest_topics(self, query: str, top_k: int = 2) -> List[Tuple[int, float]]: query_embedding = self._get_embedding_cached(query) topic_similarities = [] for topic_id, doc_indices in self.clusters.items(): topic_embeddings = self.embeddings[doc_indices] similarity = cosine_similarity([query_embedding], topic_embeddings).max() topic_similarities.append((topic_id, similarity)) # Remove [0] return sorted(topic_similarities, key=lambda x: x[1], reverse=True)[:top_k] # Example usage if __name__ == "__main__": # Initialize knowledge base kb = SmartKnowledgeBase() # Add technical documentation kb.add_document( "Password Reset Guide", "To reset your password: 1. Click 'Forgot Password' 2. Enter your email " "3. Follow the reset link 4. Create a new password meeting security requirements" ) kb.add_document( "Account Security", "Secure your account by enabling 2FA, using a strong password, and regularly " "monitoring account activity. Enable login notifications for suspicious activity." ) kb.add_document( "Billing Overview", "Your billing cycle starts on the 1st of each month. View charges, update " "payment methods, and download invoices from the Billing Dashboard." ) kb.add_document( "Payment Methods", "We accept credit cards, PayPal, and bank transfers. Update payment methods " "in Billing Settings. New payment methods are verified with a $1 hold." ) kb.add_document( "Installation Guide", "Install by downloading the appropriate package for your OS. Run with admin " "privileges. Follow prompts to select installation directory and components." ) kb.add_document( "System Requirements", "Minimum: 8GB RAM, 2GB storage, Windows 10/macOS 11+. Recommended: 16GB RAM, " "4GB storage, SSD, modern multi-core processor for optimal performance." ) # Example 1: Search for password-related help print("\nSearching for password help:") results = kb.search("How do I change my password?") for title, content, score in results: print(f"\nTitle: {title}") print(f"Relevance: {score:.2f}") print(f"Content: {content[:100]}...") # Example 2: Get topic suggestions print("\nGetting topics for billing query:") query = "Where can I update my credit card?" topics = kb.suggest_topics(query) for topic_id, relevance in topics: print(f"\nTopic {topic_id} (Relevance: {relevance:.2f}):") for title, content in kb.get_topic_documents(topic_id): print(f"- {title}: {content[:50]}...") # Example 3: Get all documents in a topic print("\nAll documents in Topic 0:") for title, content in kb.get_topic_documents(0): print(f"\nTitle: {title}") print(f"Content: {content[:100]}...") ``` The `SmartKnowledgeBase` class implements an intelligent document retrieval and organization system using embeddings. You can add documents (`kb.add_document()`), search based on the user's question (`kb.searchsearch()`), and retrieve results. 2. Run the script: With the server running in your first terminal, run the following command in a second terminal within your working directory: ```sh python kb_system.py ``` ```sh python kb_system.py ``` ```sh magic run python kb_system.py ``` On your first run, this might take longer. The following is the expected output. ```output Title: Password Reset Guide Relevance: 0.61 Content: To reset your password: 1. Click 'Forgot Password' 2. Enter your email 3. Follow the reset link 4. C... ``` The text has been shortened for brevity. ## Conclusion In this tutorial, you learned how to: - Set up and test a local embeddings server using the `all-mpnet-base-v2` model - Build a smart knowledge base system that can process and retrieve documents based on semantic similarity - Implement document clustering and topic-based organization - Create a robust search functionality using embeddings --- ## create_matmul_configs_ampere `create_matmul_configs_ampere[key: String, a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## create_task `create_task(owned handle: Coroutine[type, origins], out task: Task[type, origins])` Run the coroutine as a task on the AsyncRT Runtime. This function creates a task from a coroutine and schedules it for execution on the async runtime. The task will execute asynchronously without blocking the current execution context. **Args:** * ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred. **Returns:** The `task` output parameter is initialized with the created task. --- ## create_tile_configs `create_tile_configs[key: String, a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## create_tma_tile `create_tma_tile[*tile_sizes: Int, *, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](ctx: DeviceContext, tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[dtype, row_major[::Origin[::Bool(_to_int_tuple[*::Int]())]` Creates a `TMATensorTile` with specified tile dimensions and swizzle mode. This function creates a hardware-accelerated Tensor Memory Access (TMA) descriptor for efficient asynchronous data transfers between global memory and shared memory. It configures the tile dimensions and memory access patterns based on the provided parameters. **Constraints:** * The last dimension's size in bytes must not exceed the swizzle mode's byte limit (32B for SWIZZLE\_32B, 64B for SWIZZLE\_64B, 128B for SWIZZLE\_128B). * Only supports 2D tensors in this overload. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of the tile to be transferred. For 2D tensors, this should be \[height, width]. The dimensions determine the shape of data transferred in each TMA operation. * ​swizzle\_mode (`TensorMapSwizzle`): The swizzling mode to use for memory access optimization. Swizzling can improve memory access patterns for specific hardware configurations. **Args:** * ​ctx (`DeviceContext`): The CUDA device context used to create the TMA descriptor. * ​tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor from which data will be transferred. This defines the global memory layout and data type. **Returns:** A `TMATensorTile` configured with the specified tile dimensions and swizzle mode, ready for use in asynchronous data transfer operations. `create_tma_tile[type: DType, rank: Int, tile_shape: IndexList[rank], /, is_k_major: Bool = True, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), *, __tile_layout: Layout = row_major(tile_shape.__getitem__[::Indexer](0), tile_shape.__getitem__[::Indexer](1)), __desc_layout: Layout = _tma_desc_tile_layout[::DType,::Int,::IndexList[$1, ::DType()](ctx: DeviceContext, tensor: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[type, __tile_layout, __desc_layout]` Creates a `TMATensorTile` with advanced configuration options for 2D or 3D tensors. This overload provides more control over the TMA descriptor creation, allowing specification of data type, rank, and layout orientation. It supports both 2D and 3D tensors and provides fine-grained control over the memory access patterns. **Constraints:** * Only supports 2D and 3D tensors (rank must be 2 or 3). * For non-SWIZZLE\_NONE modes, the K dimension size in bytes must be a multiple of the swizzle mode's byte size. * For MN-major layout, only SWIZZLE\_128B is supported. * For 3D tensors, only K-major layout is supported. **Parameters:** * ​type (`DType`): DType The data type of the tensor elements. * ​rank (`Int`): Int The dimensionality of the tensor (must be 2 or 3). * ​tile\_shape (`IndexList[rank]`): IndexList\[rank] The shape of the tile to be transferred. * ​is\_k\_major (`Bool`): Bool = True Whether the tensor layout is K-major (True) or MN-major (False). K-major is typically used for weight matrices, while MN-major is used for activation matrices in matrix multiplication operations. * ​swizzle\_mode (`TensorMapSwizzle`): TensorMapSwizzle = TensorMapSwizzle.SWIZZLE\_NONE The swizzling mode to use for memory access optimization. * ​\_\_tile\_layout (`Layout`): Layout = Layout.row\_major(tile\_shape\[0], tile\_shape\[1]) Internal parameter for the tile layout in shared memory. * ​\_\_desc\_layout (`Layout`): Layout = \_tma\_desc\_tile\_layout\[...] Internal parameter for the descriptor layout, which may differ from the tile layout to accommodate hardware requirements. **Args:** * ​ctx (`DeviceContext`): DeviceContext The CUDA device context used to create the TMA descriptor. * ​tensor (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor\[type, \**, \*\**] The source tensor from which data will be transferred. This defines the global memory layout and must match the specified data type. **Returns:** A `TMATensorTile` configured with the specified parameters, ready for use in asynchronous data transfer operations. --- ## cumsum `cumsum[rank: Int, type: DType, exclusive: Bool, reverse: Bool](output: NDBuffer[type, rank, origin], input: NDBuffer[type, rank, origin], axis: Int)` Implements the CumSum operator from the ONNX spec: Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis). **Parameters:** * ​rank (`Int`): Rank of the input and output tensors. * ​type (`DType`): Type of the input and output tensors. * ​exclusive (`Bool`): If set to True, return exclusive sum (top element not included). * ​reverse (`Bool`): If set to True, perform cumsum operation in reverse direction. **Args:** * ​output (`NDBuffer[type, rank, origin]`): The output tensor. * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​axis (`Int`): The axis on which to perform the cumsum operation. --- ## cumsum ## Functions * [​`cumsum`](./cumsum): Implements the CumSum operator from the ONNX spec: Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis). --- ## cumsum `cumsum(dst: NDBuffer[type, 1, origin], src: NDBuffer[type, 1, origin, shape, strides])` Computes the cumulative sum of all elements in a buffer. dst\[i] = src\[i] + src\[i-1] + ... + src\[0]. **Args:** * ​dst (`NDBuffer[type, 1, origin]`): The buffer that stores the result of cumulative sum operation. * ​src (`NDBuffer[type, 1, origin, shape, strides]`): The buffer of elements for which the cumulative sum is computed. --- ## cwd `cwd() -> Path` Gets the current directory. **Returns:** The current directory. --- ## Death of a value As soon as a value/object is no longer used, Mojo destroys it. Mojo does *not* wait until the end of a code block—or even until the end of an expression—to destroy an unused value. It destroys values using an “as soon as possible” (ASAP) destruction policy that runs after every sub-expression. Even within an expression like `a+b+c+d`, Mojo destroys the intermediate values as soon as they're no longer needed. Mojo uses static compiler analysis to find the point where a value is last used. Then, Mojo immediately ends the value's lifetime and calls the `__del__()` destructor to perform any necessary cleanup for the type. For example, notice when the `__del__()` destructor is called for each instance of `MyPet`: ```mojo @value struct MyPet: var name: String var age: Int fn __del__(owned self): print("Destruct", self.name) fn pets(): var a = MyPet("Loki", 4) var b = MyPet("Sylvie", 2) print(a.name) # a.__del__() runs here for "Loki" a = MyPet("Charlie", 8) # a.__del__() runs immediately because "Charlie" is never used print(b.name) # b.__del__() runs here pets() ``` ```output Loki Destruct Loki Destruct Charlie Sylvie Destruct Sylvie ``` Notice that each initialization of a value is matched with a call to the destructor, and `a` is actually destroyed multiple times—once for each time it receives a new value. Also notice that this `__del__()` implementation doesn't actually do anything. Most structs don't require a custom destructor, and Mojo automatically adds a no-op destructor if you don't define one. ### Default destruction behavior You may be wondering how Mojo can destroy a type without a custom destructor, or why a no-op destructor is useful. If a type is simply a collection of fields, like the `MyPet` example, Mojo only needs to destroy the fields: `MyPet` doesn't dynamically allocate memory or use any long-lived resources (like file handles). There's no special action to take when a `MyPet` value is destroyed. Looking at the individual fields, `MyPet` includes an `Int` and a `String`. The `Int` is what Mojo calls a *trivial type*. It's a statically-sized bundle of bits. Mojo knows exactly how big it is, so those bits can be reused to store something else. The `String` value is a little more complicated. Mojo strings are mutable. The `String` object has an internal buffer—a [`List`](/mojo/stdlib/collections/list/List) field, which holds the characters that make up the string. A `List` stores its contents in dynamically allocated memory on the heap, so the string can grow or shrink. The string itself doesn't have any special destructor logic, but when Mojo destroys a string, it calls the destructor for the `List` field, which de-allocates the memory. Since `String` and `Int` don't require any custom destructor logic, they both have no-op destructors: literally, `__del__()` methods that don't do anything. This may seem pointless, but it means that Mojo can call the destructor on any value when its lifetime ends. This makes it easier to write generic containers and algorithms. ### Benefits of ASAP destruction Similar to other languages, Mojo follows the principle that objects/values acquire resources in a constructor (`__init__()`) and release resources in a destructor (`__del__()`). However, Mojo's ASAP destruction has some advantages over scope-based destruction (such as the C++ [RAII pattern](https://en.cppreference.com/w/cpp/language/raii), which waits until the end of the code scope to destroy values): * Destroying values immediately at last-use composes nicely with the "move" optimization, which transforms a "copy+del" pair into a "move" operation. * Destroying values at end-of-scope in C++ is problematic for some common patterns like tail recursion, because the destructor call happens after the tail call. This can be a significant performance and memory problem for certain functional programming patterns, which is not a problem in Mojo, because the destructor call always happens before the tail call. Additionally, Mojo's ASAP destruction works great within Python-style `def` functions. That's because Python doesn't really provide scopes beyond a function scope, so the Python garbage collector cleans up resources more often than a scope-based destruction policy would. However, Mojo does not use a garbage collector, so the ASAP destruction policy provides destruction guarantees that are even more fine-grained than in Python. The Mojo destruction policy is more similar to how Rust and Swift work, because they both have strong value ownership tracking and provide memory safety. One difference is that Rust and Swift require the use of a [dynamic "drop flag"](https://doc.rust-lang.org/nomicon/drop-flags.html)—they maintain hidden shadow variables to keep track of the state of your values to provide safety. These are often optimized away, but the Mojo approach eliminates this overhead entirely, making the generated code faster and avoiding ambiguity. ## Destructor Mojo calls a value's destructor (`__del__()` method) when the value's lifetime ends (typically the point at which the value is last used). As we mentioned earlier, Mojo provides a default, no-op destructor for all types, so in most cases you don't need to define the `__del__()` method. You should define the `__del__()` method to perform any kind of cleanup the type requires. Usually, that includes freeing memory for any fields where you dynamically allocated memory (for example, via `UnsafePointer`) and closing any long-lived resources such as file handles. However, any struct that is just a simple collection of other types does not need to implement the destructor. For example, consider this simple struct: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, name: String, age: Int): self.name = name self.age = age ``` There's no need to define the `__del__()` destructor for this, because it's a simple collection of other types (`String` and `Int`), and it doesn't dynamically allocate memory. Whereas, the following struct must define the `__del__()` method to free the memory allocated by its [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): ```mojo from memory import UnsafePointer struct HeapArray: var data: UnsafePointer[Int] var size: Int fn __init__(out self, size: Int, val: Int): self.size = size self.data = UnsafePointer[Int].alloc(self.size) for i in range(self.size): (self.data + i).init_pointee_copy(val) fn __del__(owned self): for i in range(self.size): (self.data + i).destroy_pointee() self.data.free() ``` Note that a pointer doesn't *own* any values in the memory it points to, so when a pointer is destroyed, Mojo doesn't call the destructors on those values. So in the `HeapArray` example above, calling `free()` on the pointer releases the memory, but doesn't call the destructors on the stored values. To invoke the destructors, use the `destroy_pointee()` method provided by the `UnsafePointer` type. :::note You can't just call the destructor explicitly. Because `__del__()` takes `self` as an `owned` value, and owned arguments are copied by default, `foo.__del__()` actually creates and destroys a *copy* of `foo`. When Mojo destroys a value, however, it passes in the original value as `self`, not a copy. ::: It's important to notice that the `__del__()` method is an "extra" cleanup event, and your implementation does not override any default destruction behaviors. For example, Mojo still destroys all the fields in `MyPet` even if you implement `__del__()` to do nothing: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, name: String, age: Int): self.name = name self.age = age fn __del__(owned self): # Mojo destroys all the fields when they're last used pass ``` However, the `self` value inside the `__del__()` destructor is still whole (so all fields are still usable) until the destructor returns, as we'll discuss more in the following section. :::note Destructors cannot raise errors Currently a Mojo destructor isn't allowed to raise an error. This means that the destructor must be defined as an `fn` function without the `raises` keyword. Mojo won't allow you to define a destructor using `fn raises` or `def`. ::: ## Field lifetimes In addition to tracking the lifetime of all objects in a program, Mojo also tracks each field of a structure independently. That is, Mojo keeps track of whether a "whole object" is fully or partially initialized/destroyed, and it destroys each field independently with its ASAP destruction policy. For example, consider this code that changes the value of a field: ```mojo @value struct MyPet: var name: String var age: Int fn use_two_strings(): var pet = MyPet("Po", 8) print(pet.name) # pet.name.__del__() runs here, because this instance is # no longer used; it's replaced below pet.name = String("Lola") # Overwrite pet.name print(pet.name) # pet.__del__() runs here ``` The `pet.name` field is destroyed after the first `print()`, because Mojo knows that it will be overwritten below. You can also see this behavior when using the transfer sigil: ```mojo fn consume(owned arg: String): pass fn use(arg: MyPet): print(arg.name) fn consume_and_use(): var pet = MyPet("Selma", 5) consume(pet.name^) # pet.name.__moveinit__() runs here, which destroys pet.name # Now pet is only partially initialized # use(pet) # This fails because pet.name is uninitialized pet.name = String("Jasper") # All together now use(pet) # This is ok # pet.__del__() runs here (and only if the object is whole) ``` Notice that the code transfers ownership of the `name` field to `consume()`. For a period of time after that, the `name` field is uninitialized. Then `name` is reinitialized before it is passed to the `use()` function. If you try calling `use()` before `name` is re-initialized, Mojo rejects the code with an uninitialized field error. Also, if you don't re-initialize the name by the end of the `pet` lifetime, the compiler complains because it's unable to destroy a partially initialized object. Mojo's policy here is powerful and intentionally straight-forward: fields can be temporarily transferred, but the "whole object" must be constructed with the aggregate type's initializer and destroyed with the aggregate destructor. This means it's impossible to create an object by initializing only its fields, and it's likewise impossible to destroy an object by destroying only its fields. ### Field lifetimes during destruct and move The consuming-move constructor and destructor face an interesting situation with field lifetimes, because, unlike other lifecycle methods, they both take an instance of their own type as an `owned` argument, which is about to be destroyed. You don't really need to worry about this detail when implementing these methods, but it might help you better understand field lifetimes. Just to recap, the move constructor and destructor method signatures look like this: ```mojo struct TwoStrings: fn __moveinit__(out self, owned existing: Self): # Initializes a new `self` by consuming the contents of `existing` fn __del__(owned self): # Destroys all resources in `self` ``` :::note There are two kinds of "self" here: capitalized `Self` is an alias for the current type name (used as a type specifier for the `existing` argument), whereas lowercase `self` is the argument name for the implicitly-passed reference to the current instance (also called "this" in other languages, and also implicitly a `Self` type). ::: Both of these methods face an interesting but obscure problem: they both must dismantle the `existing`/`self` value that's `owned`. That is, `__moveinit__()` implicitly destroys sub-elements of `existing` in order to transfer ownership to a new instance (read more about the [move constructor](/mojo/manual/lifecycle/life#move-constructor)), while `__del__()` implements the deletion logic for its `self`. As such, they both need to own and transform elements of the `owned` value, and they definitely don't want the original `owned` value's destructor to also run—that could result in a double-free error, and in the case of the `__del__()` method, it would become an infinite loop. To solve this problem, Mojo handles these two methods specially by assuming that their whole values are destroyed upon reaching any return from the method. This means that the whole object may be used as usual, up until the field values are transferred or the method returns. For example, the following code works as you would expect (within the destructor, we can still pass ownership of a field value to another function, and there's no infinite loop to destroy `self`): ```mojo fn consume(owned str: String): print('Consumed', str) struct TwoStrings: var str1: String var str2: String fn __init__(out self, one: String): self.str1 = one self.str2 = String("bar") fn __moveinit__(out self, owned existing: Self): self.str1 = existing.str1 self.str2 = existing.str2 fn __del__(owned self): self.dump() # Self is still whole here # Mojo calls self.str2.__del__() since str2 isn't used anymore consume(self.str1^) # self.str1 has been transferred so it is also destroyed now; # `self.__del__()` is not called (avoiding an infinite loop). fn dump(mut self): print('str1:', self.str1) print('str2:', self.str2) fn use_two_strings(): var two_strings = TwoStrings("foo") ``` ## Explicit lifetime extension So far, we've described how Mojo destroys a value at the point it's last used, and this works great in almost all situations. Mojo [origins](/mojo/manual/values/lifetimes) help the compiler track values that are allocated in one place and used in another. However, there are very rare situations in which you may need to explicitly extend the lifetime of a value. This can happen: - When you're writing tests that generate values that aren't actually used, to avoid the compiler issuing warnings and/or optimizing away values. - You're writing unsafe code (for example, code that explicitly manipulates a value's `origin`). In these cases, you can force Mojo to keep a value alive up to a certain point by assigning the value to the `_` discard pattern at the point where it's okay to destroy it. For example: ```mojo # Keep foo alive until this point _ = foo ``` If you don't _know_ you need to do this, you probably don't. --- ## debug This module includes the debug hook functions. ## Functions * [​`breakpointhook`](/mojo/stdlib/sys/debug/breakpointhook): Cause an execution trap with the intention of requesting the attention of a debugger. --- ## debug_assert `debug_assert[: origin.set, //, cond: fn() capturing -> Bool, write_mode: Int = 0, assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), cpu_only: Bool = False, *Ts: Writable = *?](*messages: *Ts, *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the condition is true at run time. If the condition is false, the assertion displays the given message and causes the program to exit. You can pass in multiple arguments to generate a formatted message. No string allocation occurs unless the assertion is triggered. ```mojo x = 0 debug_assert(x > 0, "expected x to be more than 0 but got: ", x) ``` Normal assertions are off by default—they only run when the program is compiled with all assertions enabled. You can set the `assert_mode` to `safe` to create an assertion that's on by default: ```mojo debug_assert[assert_mode="safe"]( x > 0, "expected x to be more than 0 but got: ", x ) ``` Use the `ASSERT` variable to turn assertions on or off when building or running a Mojo program: ```sh mojo -D ASSERT=all main.mojo ``` The `ASSERT` variable takes the following values: * all: Turn on all assertions. * safe: Turn on "safe" assertions only. This is the default. * none: Turn off all assertions, for performance at the cost of safety. * warn: Turn on all assertions, but print any errors instead of exiting. To ensure that you have no run-time penalty from your assertions even when they're disabled, make sure there are no side effects in your message and condition expressions. For example: ```mojo person = "name: john, age: 50" name = "john" debug_assert(String("name: ") + name == person, "unexpected name") ``` This will have a run-time penalty due to allocating a `String` in the condition expression, even when assertions are disabled. To avoid this, put the condition inside a closure so it runs only when the assertion is turned on: ```mojo fn check_name() capturing -> Bool: return String("name: ") + name == person debug_assert[check_name]("unexpected name") ``` If you need to allocate, and so don't want the assert to ever run on GPU, you can set it to CPU only: ```mojo debug_assert[check_name, cpu_only=True]("unexpected name") ``` For compile-time assertions, see [`constrained()`](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​cond (`fn() capturing -> Bool`): The function to invoke to check if the assertion holds. * ​write\_mode (`Int`): Determines whether to keep values in register or not. * ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on. * default ("none"): Turned on when compiled with `-D ASSERT=all`. * "safe": Turned on by default. * ​cpu\_only (`Bool`): If true, only run the assert on CPU. * ​\*Ts (`Writable`): The element types for the message arguments. **Args:** * ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/) arguments to convert to a `String` message. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). `debug_assert[write_mode: Int = 0, assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), cpu_only: Bool = False, *Ts: Writable = *?](cond: Bool, *messages: *Ts, *, location: Optional[_SourceLocation] = Optional(None))` Asserts that the condition is true at run time. If the condition is false, the assertion displays the given message and causes the program to exit. You can pass in multiple arguments to generate a formatted message. No string allocation occurs unless the assertion is triggered. ```mojo x = 0 debug_assert(x > 0, "expected x to be more than 0 but got: ", x) ``` Normal assertions are off by default—they only run when the program is compiled with all assertions enabled. You can set the `assert_mode` to `safe` to create an assertion that's on by default: ```mojo debug_assert[assert_mode="safe"]( x > 0, "expected x to be more than 0 but got: ", x ) ``` Use the `ASSERT` variable to turn assertions on or off when building or running a Mojo program: ```sh mojo -D ASSERT=all main.mojo ``` The `ASSERT` variable takes the following values: * all: Turn on all assertions. * safe: Turn on "safe" assertions only. This is the default. * none: Turn off all assertions, for performance at the cost of safety. * warn: Turn on all assertions, but print any errors instead of exiting. To ensure that you have no run-time penalty from your assertions even when they're disabled, make sure there are no side effects in your message and condition expressions. For example: ```mojo person = "name: john, age: 50" name = "john" debug_assert(String("name: ") + name == person, "unexpected name") ``` This will have a run-time penalty due to allocating a `String` in the condition expression, even when assertions are disabled. To avoid this, put the condition inside a closure so it runs only when the assertion is turned on: ```mojo fn check_name() capturing -> Bool: return String("name: ") + name == person debug_assert[check_name]("unexpected name") ``` If you need to allocate, and so don't want the assert to ever run on GPU, you can set it to CPU only: ```mojo debug_assert[check_name, cpu_only=True]("unexpected name") ``` For compile-time assertions, see [`constrained()`](/mojo/stdlib/builtin/constrained/constrained). **Parameters:** * ​write\_mode (`Int`): Determines whether to keep values in register or not. * ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on. * default ("none"): Turned on when compiled with `-D ASSERT=all`. * "safe": Turned on by default. * ​cpu\_only (`Bool`): If true, only run the assert on CPU. * ​\*Ts (`Writable`): The element types for the message arguments. **Args:** * ​cond (`Bool`): The bool value to assert. * ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/) arguments to convert to a `String` message. * ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`). --- ## debug_assert Implements run-time assertions. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `ASSERT_MODE` `alias ASSERT_MODE = env_get_string[::StringSlice[::Bool()` ### `WRITE_MODE` `alias WRITE_MODE = Int` ### `WRITE_MODE_MEM` `alias WRITE_MODE_MEM = 1` ### `WRITE_MODE_REG` `alias WRITE_MODE_REG = 0` ## Functions * [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/debug_assert): Asserts that the condition is true at run time. --- ## Debugging The Mojo extension for Visual Studio Code enables you to use VS Code's built-in debugger with Mojo programs. (The Mojo extension also supports debugging C, C++, and Objective-C.) For complete coverage of VS Code's debugging features, see [Debugging in Visual Studio Code](https://code.visualstudio.com/docs/editor/debugging). This page describes the features available through the Mojo extension, as well as current limitations of the Mojo debugger. The MAX SDK includes the [LLDB debugger](https://lldb.llvm.org/) and a Mojo LLDB plugin. Together these provide the low-level debugging interface for the Mojo extension. You can also use the `mojo debug` command to start a command-line debugging session using LLDB or to launch a Mojo debugging session in VS Code. The MAX SDK also includes support for debugging Mojo programs running on GPU. This requires some extra software and configuration. Currently GPU debugging only works with NVIDIA GPUs. For details, see [GPU debugging](/mojo/tools/gpu-debugging). ## Start debugging There are several ways to start a debug session in VS Code. To start debugging, you'll need to have a Mojo project to debug. There are a number of examples ranging from simple to complex in [our GitHub repo](https://github.com/modular/modular/tree/main/examples/mojo). :::note **VS Code veteran?** If you're already familiar with debugging in VS Code, the material in this section will mostly be review. You might want to skip ahead to [Launch configurations](#launch-configurations) or see [Using the debugger](#using-the-debugger) for notes on the features supported in the Mojo debugger. ::: ### Quick run or debug If your active editor tab contains a Mojo file with an `fn main()` entry point, one of the quickest ways to run or debug it is using the **Run or Debug** button in the Editor toolbar. ![](images/quick-run-or-debug-button.png) To start debugging the current file: * Open the **Run or Debug** dropdown menu and choose **Debug Mojo File** or **Debug Mojo File in Dedicated Terminal**. ![](images/quick-run-or-debug-menu.png) The two debug configurations differ in how they handle input and output: * **Debug Mojo File** launches the Mojo program detached from any terminal. Standard output and standard error output for the program are displayed in the **Debug Console**. You can't write to the program's standard input, but you can see the program's output and interact with the debugger in a single location. * **Debug Mojo File in Dedicated Terminal** creates a new instance of VS Code's integrated terminal and attaches the program's input and output to the terminal. This lets you interact with the program's standard input, standard output and standard error output in the terminal, while the **Debug Console** is used only for interactions with the debugger. The **Run or Debug** button uses predefined launch configurations. There's currently no way to modify the `args`, `env`, `cwd` or other settings for programs launched with the **Run or Debug** configurations. If you need to customize any of these things, see [Edit launch configurations](#edit-launch-configurations). After you choose one of the debug configurations, the button updates to show the debug symbol. Click the button to re-run the previous configuration. ![](images/quick-run-or-debug-button-debug.png). ### Run and Debug view The **Run and Debug** view includes a button to launch debug sessions and a menu to select debug configurations. It also has areas to display current variables, watch expressions, the current call stack, and breakpoints. ![](images/run-and-debug-view.png) Figure 1. Run and Debug view To open **Run and Debug** view, click the **Run and Debug** icon in the **Activity Bar** (on the left side of the VS Code window) or press Control+Shift+D (Command+Shift+D on macOS). ![](images/run-and-debug-icon.png) If you haven't created any launch configurations in the current project, VS Code shows the **Run start view**. ![](images/run-start-view.png) Figure 2. Run start view If you've already launched a debug session or created a `launch.json` file to define launch configurations, you'll see the **Launch configurations** menu, which lets you choose configurations and start debug sessions: ![](images/launch-configuration-menu.png) Figure 3. Launch configurations menu ### Other ways to start a debug session There are a number of other ways to start a debug session. #### Launching from the Command Palette If you have a Mojo file open in your active editor, you can also start a debug session from the **Command Palette**. 1. Click **View** > **Command Palette** or press Control+Shift+P (Command+Shift+P on macOS). 2. Enter "Mojo" at the prompt to bring up the Mojo commands. You should see the same debug configurations described in [Quick run or debug](#quick-run-or-debug). #### Launch from the File Explorer To launch a debug session from the **File Explorer** view: 1. Right-click on a Mojo file. 2. Select a Mojo debug configuration. You should see the same debug configurations described in [Quick run or debug](#quick-run-or-debug). #### Debug with F5 Press F5 to start a debug session using the current debug configuration. If you don't have any existing debug configurations available to select, and your active editor contains a Mojo file with an `fn main()` entry point, pressing F5 will launch and debug the current file using the **Debug Mojo File** action described in [Quick run or debug](#quick-run-or-debug). ## Starting the debugger from the command line Use the `mojo debug` command to start a debug session from the command line. You can choose from two debugging interfaces: * With the `--vscode` flag, `mojo debug` starts a debug session on VS Code if it's running and the Mojo extension is enabled. * Without the `--vscode` flag, `mojo debug` starts a command-line [LLDB debugger](https://lldb.llvm.org/) session. You can choose to build and debug a Mojo file, run and debug a compiled binary, or to attach the debugger to a running process. :::note Environment variables When you debug a program from the command line using `--vscode`, the program runs with the environment variables set in the terminal. When launching from inside VS Code via the GUI, the environment is defined by the VS Code [launch configuration](#launch-configurations). ::: For a full list of command-line options, see the [`mojo debug` reference page](/mojo/cli/debug). ### Start a debug session from the command line With VS Code open, run the following command (either from VS Code's integrated terminal or an external shell): ```bash mojo debug --vscode myproject.mojo ``` Or to debug a compiled binary: ```bash mojo debug --vscode myproject ``` For best results, build with the `-O0 -g` command-line options when you build a binary that you intend to debug—this produces a binary with full debug info. (When you call `mojo debug` on a Mojo source file, it includes debug information by default.) See the [`mojo build` reference page](/mojo/cli/build) for details on compilation options. ### Attach the debugger to a running process from the command line You can also attach the debugger to a running process by specifying either the process ID or process name on the command line: ```bash mojo debug --vscode --pid ``` Or: ```bash mojo debug --vscode --process-name ``` ## Launch configurations VS Code *launch configurations* let you define setup information for debugging your applications. The Mojo debugger provides the following launch configuration templates: * Debug current Mojo file. Launches and debugs the Mojo file in the active editor tab. Effectively the same as the **Debug Mojo File** action described in [Quick run or debug](#quick-run-or-debug), but with more configuration options. * Debug Mojo file. Like the previous entry, except that it identifies a specific file to launch and debug, no matter what file is displayed in the active editor. * Debug binary. This configuration operates on a prebuilt binary, which could be written in any mixture of languages supported by LLDB (Mojo, C, C++, etc.). You need to set the `program` field to the path of your binary. * Attach to process. Launches a debug session attached to a running process. On launch, you choose the process you want to debug from a list of running processes. You can edit any of these templates to customize them. All VS Code launch configurations must contain the following attributes: * `name`. The name of the launch configuration, which shows up in the UI (for example, "Run current Mojo file"). * `request`. Can be either `launch` (to run a program from VS Code) or `attach` (to attach to and debug a running file). * `type`. Use `mojo-lldb` for the Mojo debugger. Use `mojo-cuda-gdb` to debug on GPU. In addition, Mojo launch configurations can contain the following attributes: * `args`. Any command-line arguments to be passed to the program. * `cwd`. The current working directory to run the program in. * `description`. A longer description of the configuration, not shown in the UI. * `env`. Environment variables to be set before running the program. * `mojoFile`. Path to a Mojo file to launch and debug. * `pid`. Process ID of the running process to attach to. * `program`. Path to a compiled binary to launch and debug, or the program to attach to. * `runInTerminal`. True to run the program with a dedicated terminal, which allows the program to receive standard input from the terminal. False to run the program with its output directed to the **Debug Console**. Mojo GPU launch configurations can contain the following attributes: * `breakOnLaunch`. Set to true to automatically break when a GPU kernel launches. * `initCommands`. An array of commands to issue to the debugger on startup. To use the classic CUDA-GDB debugger backend, add the following lines to your configuration: ```json "initCommands": [ "set environment CUDBG_USE_LEGACY_DEBUGGER=1" ], ``` * `legacyDebugger`. Set to true to use the classic debugger backend. If configuration is a `launch` request, the configuration must include either the `mojoFile` or `program` attribute. For `attach` requests, the configuration must include either the `pid` or `program` attribute. VS Code performs variable substitution on the launch configurations. You can use `${workspaceFolder}` to substitute the path to the current workspace, and `${file}` to represent the file in the active editor tab. For a complete list of variables, see the VS Code [Variables reference](https://code.visualstudio.com/docs/editor/variables-reference). For more information, see the VS Code documentation for [Launch configurations](https://code.visualstudio.com/docs/editor/debugging#_launch-configurations). :::note Compilation options Mojo launch configurations don't allow you to specify compilation options. If you need to specify compilation options, you can build the binary using [`mojo build`](/mojo/cli/build), then use a launch configuration with the `program` option to launch the compiled binary. Or if you [start the debugger from the command line](#starting-the-debugger-from-the-command-line), you can pass compilation options to the `mojo debug` command. ::: ### Edit launch configurations To edit launch configurations: 1. If the **Run and Debug** view isn't already open, click the **Run and Debug** icon in the **Activity Bar** (on the left side of the VS Code window) or press Control+Shift+D (Command+Shift+D on macOS). ![](images/run-and-debug-icon.png) 2. Create or open the `launch.json` file: 1. If you see the **Run start view**, click **create a launch.json file**. 2. If you already have launch configurations set up, click the gear icon next to the **Launch configurations** menu. ![](images/launch-configuration-menu.png) 3. Select **Mojo** from the list of debuggers. VS Code opens the new `launch.json` file in an editor tab, with templates for some common debug actions. Click **Add configuration** to add a new configuration template. ## Using the debugger When a debug session is running, use the debug toolbar to pause, continue, and step through the program. ![](images/debug-toolbar.png) The buttons on the toolbar are: * **Continue/Pause**: If the program is stopped, resume the normal execution of the program up to the next breakpoint, signal or crash. Otherwise, pause all the threads of the program at once. * **Step Over**: Execute the next line of code without stopping at function calls. * **Step Into**: Execute the next line of code and stop at the first function call. If the program is stopped just before a function call, steps into the function so you can step through it line-by-line. * **Step Out**: Finish the execution of the current function and stop right after returning to the parent function. * **Restart**: If this is a `launch` session, terminate the current program and restart the debug session. Otherwise, detach from the target process and reattach to it. * **Stop**: If this is a `launch` session, terminate the current program. Otherwise, detach from the target process without killing it. The debugger currently has the following limitations: * No support for breaking automatically on Mojo errors. * When stepping out of a function, the returned value is not displayed. * LLDB doesn't support stopping or resuming individual threads. ### Breakpoints The Mojo debugger supports setting [standard breakpoints](https://code.visualstudio.com/docs/editor/debugging#_breakpoints), [logpoints](https://code.visualstudio.com/docs/editor/debugging#_logpoints), [function breakpoints](https://code.visualstudio.com/docs/editor/debugging#_function-breakpoints), [data breakpoints](https://code.visualstudio.com/docs/editor/debugging#_data-breakpoints), and [triggered breakpoints](https://code.visualstudio.com/docs/editor/debugging#_triggered-breakpoints), as described in the VS Code documentation. The Mojo debugger also supports *error breakpoints* (also known as "break on raise"), which break whenever a `raise` statement is executed. When debugging Mojo code, the debugger doesn't support conditional breakpoints based on an expression (it does support hit counts, which VS Code classifies as a kind of conditional breakpoint). When editing a breakpoint, you're offered four options: * **Expression**. Set a conditional breakpoint (not currently supported). * **Hit Count**. Add a hit count to a breakpoint (supported). * **Log Message**. Add a logpoint (supported) * **Wait for Breakpoint**. Add a triggered breakpoint (supported). #### Set a hit count breakpoint A hit count breakpoint is a breakpoint that only breaks execution after the debugger hits it a specified number of times. To add a hit count breakpoint: 1. Right click in the left gutter of the editor where you want to place the breakpoint, and select **Add Conditional Breakpoint.** 2. Select **Hit Count** from the menu and enter the desired hit count. To change an existing breakpoint to a hit count breakpoint: 1. Right click on the breakpoint in the left gutter of the editor and select **Edit breakpoint**. 2. Select **Hit Count** from the menu and enter the desired hit count. You can also edit a breakpoint from the **Breakpoints** section of the **Run and Debug** view: * Right-click on the breakpoint and select **Edit Condition**, or, * Click the **Edit Condition** icon next to the breakpoint. This brings up the same menu, **next to the breakpoint in the editor tab**. #### Enable error breakpoints You can enable and disable error breakpoints in VS Code by selecting "Mojo Raise" in the **Breakpoints** section of the **Run and Debug** view. If enabled during debugging, executing a `raise` statement causes the debugger to stop execution and highlight the line of code where the error was raised. ![VS Code window showing a program paused in the debugger with the Run and Debug view visible. The program is paused at a raise statement.](images/break-on-raise.png) ### View local variables When a program is paused in the debugger, the editor shows local variable values inline. You can also find them in the **Variables** section of the **Run and Debug** view. ![VS Code window showing a program paused in the debugger, with the variables sections of the Run and Debug view visible. The edit shows three functions (nested2, nested1, and main). The program is paused at a breakpoint in nested2.](images/debugger-variables.png) Figure 4. Local variable values displayed in the debugger ### View the call stack When a program is paused in the debugger, the **Run and Debug** view shows the current call stack. (You may see multiple call stacks, one for each active thread in the program.) ![VS Code window showing a program paused in the debugger, with the call stack and variables sections of the Run and Debug view visible. The call stack shows three functions (nested2, nested1, and main). The program is paused at a breakpoint in nested2; the parent function nested1 is selected in the call stack, and editor highlights the current line in nested1 (the call to nested2()).](images/debugger-call-stack-nested1.png) Figure 5. Call stack in Run and Debug view The **Call Stack** section of the Run and Debug view shows a stack frame for each function call in the current call stack. Clicking on the name of the function highlights the current line in that function. For example, in Figure 5, the program is paused at a breakpoint in `nested2()`, but the parent function, `nested1()` is selected in the call stack. The editor highlights the current line in `nested1()` (that is, the call to `nested2()`) and shows the current local variable values for `nested1()`. ### Use the Debug Console The **Debug Console** gives you a command-line interface to the debugger. The **Debug Console** processes LLDB commands and Mojo expressions. Anything prefixed with a colon (`:`) is treated as an LLDB command. Any other input is treated as an expression. Currently Mojo expressions are limited to inspecting variables and their fields. The console also supports subscript notation (`vector[index]`) for certain data structures in the standard library, including `List` and `SIMD`. In the future, we intend to provide a way for arbitrary data structures to support subscript notation in the **Debug Console**. :::note The **Debug Console** only accepts input when the program is paused. ::: ## Tips and tricks There are several features in the standard library that aren't directly related to the debugger, but which can help you debug your programs. These include: * Programmatic breakpoints. * Setting parameters from the Mojo command line. ### Set a programmatic breakpoint To break at a specific point in your code, you can use the built-in [`breakpoint()`](/mojo/stdlib/builtin/breakpoint/breakpoint) function: ```mojo if some_value.is_valid(): do_the_right_thing() else: # We should never get here! breakpoint() ``` If you have VS Code open and run this code in debug mode (either using VS Code or `mojo debug`), hitting the `breakpoint()` call causes an error, which triggers the debugger. :::note Assertions The [`testing`](/mojo/stdlib/testing/testing/) module includes a number of ways to specify assertions. Assertions also trigger an error, so can open the debugger in the same way that a `breakpoint()` call will. ::: ### Set parameters from the Mojo command line You can use the [`param_env`](/mojo/stdlib/sys/param_env/) module to retrieve parameter values specified on the Mojo command line. Among other things, this is an easy way to switch debugging logic on and off. For example: ```mojo from param_env import is_defined def some_function_with_issues(): # ... @parameter if is_defined["DEBUG_ME"](): breakpoint() ``` To activate this code, use the [`-D` command-line option](/mojo/cli/debug#compilation-options) to define `DEBUG_ME`: ```bash mojo debug -D DEBUG_ME main.mojo ``` The `is_defined()` function returns a compile-time true or false value based on whether the specified name is defined. Since the `breakpoint()` call is inside a [parametric `if` statement](/mojo/manual/decorators/parameter#parametric-if-statement), it is only included in the compiled code when the `DEBUG_ME` name is defined on the command line. ## Troubleshooting ### `error: can't connect to the RPC debug server socket` If using `mojo debug --vscode` gives you the message `error: can't connect to the RPC debug server socket: Connection refused`, try the following possible fixes: * Make sure VS Code is open. * If VS Code is already open, try restarting VS Code. * If there are other VS Code windows open, try closing them and then restarting. This error can sometimes occur when multiple windows have opened and closed in certain orders. ### `error: couldn't get a valid response from the RPC server` If using `mojo debug --vscode` gives you the message `error: couldn't get a valid response from the RPC server`, try the following possible fixes: * Make sure VS Code is open to a valid Mojo codebase. This error can sometimes happen if the VS Code window is open to some other codebase. * If there are multiple VS Code windows open, try closing all but the one you wish to debug in. * Restart VS Code. * Reinstall the SDK and restart VSCode. * If you are working on a development version of the SDK, make sure that all SDK tools are properly built with your build system, and then reload VS Code. * As a last resort, restarting your entire computer can fix this problem. If these steps don't help, please file an issue. We'd love your help identifying possible causes and fixes! --- ## default_config_sm90 `default_config_sm90[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, wgmma_shape: IndexList[3]]() -> MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape]` --- ## Defaultable The `Defaultable` trait describes a type with a default constructor. Implementing the `Defaultable` trait requires the type to define an `__init__` method with no arguments: ```mojo struct Foo(Defaultable): var s: String fn __init__(out self): self.s = "default" ``` You can now construct a generic `Defaultable` type: ```mojo fn default_init[T: Defaultable]() -> T: return T() var foo = default_init[Foo]() print(foo.s) ``` ```plaintext default ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self: _Self)` Create a default instance of the value. --- ## Deploy a PyTorch model from Hugging Face import SmallCards from '@site/src/components/SmallCards'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Requirements from '@site/src/components/Requirements'; import { requirementsWithGPU } from '@site/docs/max/requirements'; We designed MAX to simplify the entire AI development workflow, and that includes deploying PyTorch models with a high-performance serving endpoint. As we'll show you in this tutorial, deploying an endpoint with MAX is as simple as deploying a Docker container—you don't have to write any new code to use MAX. Currently, the MAX container includes a REST API that supports large-language models (LLMs) only, so that's what we'll deploy. Specifically, we'll deploy the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct) model, but you can select a different PyTorch LLM from Hugging Face. (See our [README](https://github.com/modular/modular/tree/main/max) for a list of model architectures we currently support.) We've also included instructions to deploy to the cloud provider of your choice, either AWS, GCP, or Azure. :::caution MAX Serve preview This is an early look at our Max container for PyTorch models. Currently, MAX Serve runs most PyTorch models using PyTorch eager execution, which will use GPUs for acceleration. In the future, MAX Serve will also accelerate PyTorch models using our MAX graph compiler. ::: If you want to instead deploy a highly-optimized LLM built with MAX, see [Deploy Llama 3 with MAX Serve on GPU](/max/tutorials/max-serve-local-to-cloud). System requirements: ## Deploy to a local endpoint In this section, you'll use MAX to serve the Qwen2.5 model on a local endpoint. 1. Set up your environment: 2. Start a local endpoint for the Qwen2.5 model: ```sh max serve --model-path=Qwen/Qwen2.5-1.5B-Instruct ``` In addition to starting a local server, this downloads the model weights and compiles the model, which might take some time. The endpoint is ready when you see the URI printed in your terminal: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` 3. Now open another terminal to send a request using `curl`: ```sh curl -N http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of Mongolia?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` You should see a response in your command line similar to the following: ```output The capital city of Mongolia is Ulaanbaatar. ``` That's it! In just a few steps, you've connected a Hugging Face LLM model to an endpoint so it can receive and respond to inference requests. Now let's deploy the same thing to a GPU instance on the cloud. ## Deploy to a cloud provider In the first part of this tutorial, you used MAX to deploy a Hugging Face model to a local endpoint. In this next part, you'll use a prebuilt Docker container to deploy a model to a cloud provider. ### Prerequisites This tutorial shows you how to deploy a model to one of three cloud providers: - AWS - GCP - Azure To complete this tutorial, you should: - Be familiar with the basics of at least one of these cloud providers - Have the appropriate CLI tools installed: - [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). - [Google Cloud SDK](https://cloud.google.com/sdk/docs/install). - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli). - Have a project set up that you can use to deploy the Docker container. - Verify that you have access to the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct) model. - Enable any billing permissions so you can install the appropriate APIs and launch the designated GPU instances. ### Initialize CLI tools If you haven't already done so, make sure that you've initialized your CLI tools and logged in to your account. Configure the AWS CLI: ```bash aws configure ``` Login to your AWS account: ```bash aws sso login ``` Check the credentials via `cat ~/.aws/credentials` to make sure it is set up correctly. You can also include the credentials as environment variables: ```bash export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID" export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY" ``` Initialize the Google Cloud SDK: ```bash gcloud init ``` Login to your Google Cloud account: ```bash gcloud auth login ``` Initialize the Azure CLI: ```bash az init ``` Login to your Azure account: ```bash az login ``` ### Create your deployment In this section, you'll go through the steps needed to create a deployment. These steps vary depending on the Cloud provider you prefer to use. For AWS, we'll create a AWS CloudFormation template to define and configure our deployment. 1. Create a working directory for the Infrastructure as Code files. ```bash mkdir aws ``` Then, navigate to that directory. ```bash cd aws ``` 2. Set the AWS region. In this case, we'll use `us-east-1`, but you can use whatever region you prefer. ```bash export REGION="us-east-1" ``` 3. Create an AWS CloudFormation file, `max-serve-aws.yaml`. ```bash touch max-serve-aws.yaml ``` Then, using the editor of your choice, paste the following: max-serve-aws.yaml ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: CloudFormation template to deploy MAX Serve on an EC2 instance. Parameters: InstanceType: Type: String Default: g5.4xlarge AllowedValues: - g5.4xlarge - p4d.24xlarge Description: EC2 instance type for the MAX Serve deployment. AmiId: Type: AWS::EC2::Image::Id Default: ami-02769e6d1f6a88067 Description: AMI ID for Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) in us-east-1. HuggingFaceHubToken: Type: String NoEcho: true Description: HuggingFace Hub API Token for accessing models. HuggingFaceRepoId: Type: String Default: Qwen/Qwen2.5-1.5b-instruct Description: Hugging Face Repository ID for the Model. Resources: MaxServeInstanceProfile: Type: AWS::IAM::InstanceProfile Properties: Roles: - !Ref MaxServeInstanceRole MaxServeInstanceRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: - ec2.amazonaws.com Action: - sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy Policies: - PolicyName: CloudWatchLogsAccess PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - logs:CreateLogStream - logs:PutLogEvents - logs:DescribeLogStreams Resource: !Sub 'arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/ec2/${AWS::StackName}-logs:*' MaxServeLogGroup: Type: AWS::Logs::LogGroup DeletionPolicy: Delete UpdateReplacePolicy: Delete Properties: LogGroupName: !Sub '/aws/ec2/${AWS::StackName}-logs' RetentionInDays: 1 MaxServeSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Enable HTTP access on port 80 and SSH on port 22 SecurityGroupIngress: - IpProtocol: tcp FromPort: 80 ToPort: 80 CidrIp: 0.0.0.0/0 - IpProtocol: tcp FromPort: 22 ToPort: 22 CidrIp: 0.0.0.0/0 MaxServeInstance: Type: AWS::EC2::Instance Properties: InstanceType: !Ref InstanceType ImageId: !Ref AmiId SecurityGroupIds: - !Ref MaxServeSecurityGroup IamInstanceProfile: !Ref MaxServeInstanceProfile BlockDeviceMappings: - DeviceName: /dev/xvda Ebs: VolumeSize: 100 VolumeType: gp3 DeleteOnTermination: true UserData: 'Fn::Base64': !Sub | #!/bin/bash set -xe # Enable detailed logging # Redirect all output to a log file for debugging exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1 echo "Starting user data script execution..." # Install CloudWatch agent first echo "Installing CloudWatch agent..." sudo yum install -y amazon-cloudwatch-agent # Create log files and directory with proper permissions sudo mkdir -p /var/log/max-serve sudo touch /var/log/max-serve/container.log sudo chmod 644 /var/log/max-serve/container.log sudo chown root:root /var/log/max-serve/container.log # Configure CloudWatch agent early cat /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json { "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/messages", "log_group_name": "/aws/ec2/${AWS::StackName}-logs", "log_stream_name": "instance-logs", "timestamp_format": "%b %d %H:%M:%S", "timezone": "UTC" }, { "file_path": "/var/log/max-serve/container.log", "log_group_name": "/aws/ec2/${AWS::StackName}-logs", "log_stream_name": "instance-logs", "timestamp_format": "%Y-%m-%d %H:%M:%S", "timezone": "UTC" }, { "file_path": "/var/log/user-data.log", "log_group_name": "/aws/ec2/${AWS::StackName}-logs", "log_stream_name": "instance-logs", "timestamp_format": "%Y-%m-%d %H:%M:%S", "timezone": "UTC" } ] } }, "force_flush_interval": 15 } } EOF # Start the CloudWatch agent sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s sudo systemctl enable amazon-cloudwatch-agent sudo systemctl start amazon-cloudwatch-agent # Verify CloudWatch agent is running sudo systemctl status amazon-cloudwatch-agent # Continue with Docker installation and rest of the setup echo "Installing docker..." sudo yum update -y sudo yum install -y docker aws-cfn-bootstrap sudo systemctl enable docker sudo systemctl start docker sudo usermod -a -G docker ec2-user # Verify docker is running echo "Checking docker status..." sudo systemctl status docker docker --version # Install NVIDIA Container Toolkit echo "Installing NVIDIA Container Toolkit..." distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo sudo yum clean expire-cache sudo yum install -y nvidia-docker2 sudo systemctl restart docker # Verify NVIDIA docker installation echo "Checking NVIDIA docker installation..." nvidia-smi docker info | grep -i nvidia # Pull and run the MAX Serve container echo "Pulling and running MAX Serve container..." # Add error checking for docker pull if ! sudo docker pull docker.modular.com/modular/max-nvidia-full:latest; then echo "Failed to pull container image" /opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackName} --resource MaxServeInstance --region ${AWS::Region} exit 1 fi sudo docker images # Start the container and capture logs CONTAINER_ID=$(sudo docker run -d \ --env "HF_TOKEN=${HuggingFaceHubToken}" \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ -v /home/ec2-user/.cache/huggingface:/root/.cache/huggingface \ --gpus 1 \ -p 8000:8000 \ --ipc=host \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path ${HuggingFaceRepoId}) if [ $? -ne 0 ]; then echo "Failed to start container" /opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackName} --resource MaxServeInstance --region ${AWS::Region} exit 1 fi # Start following container logs in the background sudo docker logs -f $CONTAINER_ID > /var/log/max-serve/container.log 2>&1 & # Verify container is running echo "Checking container status..." if ! sudo docker ps | grep max-nvidia-full; then echo "Container is not running" /opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackName} --resource MaxServeInstance --region ${AWS::Region} exit 1 fi Outputs: InstanceId: Description: Instance ID of the EC2 instance Value: !Ref MaxServeInstance PublicDNS: Description: Public DNS of the EC2 instance Value: !GetAtt MaxServeInstance.PublicDnsName ``` 4. Create the stack. ```bash aws cloudformation create-stack --stack-name max-serve-stack \ --template-body file://max-serve-aws.yaml \ --parameters ParameterKey=InstanceType,ParameterValue=p4d.24xlarge \ ParameterKey=HuggingFaceHubToken,ParameterValue= \ ParameterKey=HuggingFaceRepoId,ParameterValue=Qwen/Qwen2.5-1.5b-instruct \ --capabilities CAPABILITY_IAM \ --region $REGION ``` Note that you must replace `` with your actual token. In addition, this command defines the model that we want to deploy. For this tutorial, we'll use the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct) model. This deployment can take a few minutes to complete. Track the status of the deployment by running the following command: ```bash aws cloudformation describe-stacks --stack-name max-serve-stack \ --region $REGION --query 'Stacks[0].StackStatus' --output text ``` When the CloudFormation stack is deployed, you should see a status of `CREATE_COMPLETE`. Type `q` to exit this prompt in your CLI. For GCP, we'll create a `.jinja` and `.yaml` file to define and configure our deployment. 1. Create a working directory for the Infrastructure as Code files. ```bash mkdir gcp ``` Then, navigate to that directory. ```bash cd gcp ``` 2. Next, let's define a PROJECT_ID variable, which you'll use for some of the other commands you'll run later. ```bash PROJECT_ID="YOUR_PROJECT_ID" ``` Remember to replace `YOUR_PROJECT_ID` with the ID of your GCP project. 3. Enable the following APIs by running the following command: ```bash gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \ gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \ gcloud services enable compute.googleapis.com --project=${PROJECT_ID} ``` 4. Create a file, `max-serve-gcp.jinja`. ```bash touch max-serve-gcp.jinja ``` Then, using the editor of your choice, paste in the following: max-serve-gcp.jinja ```yaml resources: # Main compute instance - name: {{ properties['instanceName'] }} type: compute.v1.instance properties: zone: {{ properties['zone'] }} machineType: zones/{{ properties['zone'] }}/machineTypes/{{ properties['machineType'] }} guestAccelerators: - acceleratorType: zones/{{ properties['zone'] }}/acceleratorTypes/{{ properties['acceleratorType'] }} acceleratorCount: {{ properties['acceleratorCount'] }} disks: - deviceName: boot boot: true autoDelete: true initializeParams: sourceImage: projects/deeplearning-platform-release/global/images/{{ properties['sourceImage'] }} diskSizeGb: 100 # Disk space in GB networkInterfaces: - network: global/networks/default accessConfigs: - name: External NAT type: ONE_TO_ONE_NAT serviceAccounts: - email: default scopes: - https://www.googleapis.com/auth/cloud-platform scheduling: preemptible: false onHostMaintenance: TERMINATE # Disables live migration for GPU instances automaticRestart: true metadata: items: - key: startup-script value: | #!/bin/bash set -xe # Enable detailed logging curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh sudo bash add-logging-agent-repo.sh # Update and install dependencies sudo apt-get update sudo apt-get install -y google-fluentd curl apt-transport-https ca-certificates gnupg lsb-release software-properties-common # Configure Stackdriver logging sudo service google-fluentd start sudo systemctl enable google-fluentd # Install the NVIDIA drivers if not installed if [ ! -f /opt/google/cuda-installer ]; then sudo /opt/deeplearning/install-driver.sh fi # Add Docker GPG key and Docker repository curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null # Install Docker sudo apt-get update sudo apt-get install -y docker-ce docker-ce-cli containerd.io # Add NVIDIA Docker repository distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list # Install NVIDIA container runtime sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker # Add user to docker group sudo usermod -aG docker $(whoami) # Run the Docker container with GPU support docker run \ --env "HF_TOKEN={{ properties['huggingFaceHubToken'] }}" \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --gpus 1 \ -p 8000:8000 \ --ipc host \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path {{ properties['pytorch_model'] }} # Add firewall rule directly in template - name: allow-http-8000 type: compute.v1.firewall properties: network: global/networks/default sourceRanges: ["0.0.0.0/0"] targetTags: ["http-server"] allowed: - IPProtocol: tcp ports: ["8000"] # Outputs section to output public IP and instance details outputs: - name: instanceName value: $(ref.{{ properties['instanceName'] }}.name) description: Name of the GCP Compute instance. - name: instancePublicIP value: $(ref.{{ properties['instanceName'] }}.networkInterfaces[0].accessConfigs[0].natIP) description: Public IP address of the GCP Compute instance. ``` This file contains a couple of variables: - **hugging_face_hub_token**: Defines your Hugging Face hub token so you can access the appropriate model - **pytorch_model**: Defines the PyTorch model that you want to deploy. We'll define those variables in the next section. 5. Your next step is to define the deployment. This deployment file defines a number of properties, in particular the model that we want to deploy. For this tutorial, we'll use the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct) model. In your working directory, create a file, `max-serve-gcp.yaml`. ```bash touch max-serve-gcp.yaml ``` Then, using the editor of your choice, paste in the following: max-serve-gcp.yaml ```yaml imports: - path: max-serve-gcp.jinja resources: - name: max-serve-deployment type: max-serve-gcp.jinja properties: instanceName: max-serve-instance zone: us-central1-b machineType: a2-highgpu-1g acceleratorType: nvidia-tesla-a100 acceleratorCount: 1 sourceImage: common-cu124-v20241118-ubuntu-2004-py310 hugging_face_hub_token: pytorch_model: Qwen/Qwen2.5-1.5B-Instruct ``` :::note Make sure you replace `` with your actual Hugging Face hub token. ::: 6. Create your deployment by running the following command: ```bash gcloud deployment-manager deployments create max-serve-deployment \ --config max-serve-gcp.yaml \ --project ${PROJECT_ID} ``` The deployment might take a few minutes to complete. To track the status of the deployment, run the following command: ```bash gcloud deployment-manager deployments describe max-serve-deployment \ --project=${PROJECT_ID} ``` 1. Create a working directory for the Infrastructure as Code files. ```bash mkdir azure ``` Then, navigate to that directory. ```bash cd azure ``` 2. Set the Azure region. In this case, we'll use `eastus`, but you can use whatever region you prefer. ```bash export REGION="eastus" ``` 3. Create the resource group. ```bash az group create --name maxServeResourceGroup --location $REGION ``` The following is the expected output: ```output { "id": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP_NAME", "location": "eastus", "managedBy": null, "name": "RESOURCE_GROUP_NAME", "properties": { "provisioningState": "Succeeded" }, "tags": null, "type": "Microsoft.Resources/resourceGroups" } ``` 4. Verify that the resource group was created successfully: ```bash az group show -n maxServeResourceGroup --query properties.provisioningState -o tsv ``` The following is the expected output: ```output Succeeded ``` 5. Create a file named `startup.sh` and paste in the following contents: startup.sh ```bash echo '#!/bin/bash sudo usermod -aG docker $USER sudo systemctl restart docker sleep 10 sudo docker run \ --env "HF_TOKEN=" \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --gpus 1 \ -p 8000:8000 \ --ipc host \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path Qwen/Qwen2.5-1.5B-Instruct | base64 ``` :::note Make sure to replace `` with your Hugging Face hub token. In addition, this uses the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct) model. However, you can later use any PyTorch LLM model. ::: Then, encode the script using base64: ```bash base64 -i startup.sh | tr -d '\n' > encoded-script.txt ``` Use the output of this script for the placeholder `` in the next step. 6. Create a new file, `parameters.json` and paste in the following contents. Be sure to replace `` with the encoded output from the previous step, and `` with your own secure password. parameters.json ``` { "adminUsername": { "value": "azureuser" }, "adminPassword": { "value": "" }, "vmSize": { "value": "Standard_NV36ads_A10_v5" }, "osDiskSizeGB": { "value": 128 }, "vnetAddressPrefix": { "value": "10.0.0.0/16" }, "subnetAddressPrefix": { "value": "10.0.0.0/24" }, "location": { "value": "[parameters('location')]" }, "startupScript": { "value": "" } } ``` 7. Create a new file, `max-serve-azure.json` and paste in the following: max-serve-azure.json ```json { "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "adminUsername": { "type": "string", "metadata": { "description": "Admin username for the virtual machine." } }, "adminPassword": { "type": "securestring", "metadata": { "description": "Admin password for the virtual machine." } }, "vmSize": { "type": "string", "defaultValue": "standard_nc24ads_a100_v4", "metadata": { "description": "Size of the virtual machine." } }, "osDiskSizeGB": { "type": "int", "defaultValue": 128, "metadata": { "description": "OS disk size in GB." } }, "vnetAddressPrefix": { "type": "string", "defaultValue": "10.0.0.0/16", "metadata": { "description": "Address space for the virtual network." } }, "subnetAddressPrefix": { "type": "string", "defaultValue": "10.0.0.0/24", "metadata": { "description": "Subnet address space." } }, "startupScript": { "type": "string", "metadata": { "description": "Base64-encoded startup script." } }, "location": { "type": "string", "defaultValue": "westus3", "metadata": { "description": "Location for all resources." } } }, "resources": [ { "type": "Microsoft.Network/virtualNetworks", "apiVersion": "2021-03-01", "name": "maxServeVNet", "location": "[parameters('location')]", "properties": { "addressSpace": { "addressPrefixes": [ "[parameters('vnetAddressPrefix')]" ] }, "subnets": [ { "name": "maxServeSubnet", "properties": { "addressPrefix": "[parameters('subnetAddressPrefix')]" } } ] } }, { "type": "Microsoft.Network/publicIPAddresses", "apiVersion": "2021-03-01", "name": "maxServePublicIP", "location": "[parameters('location')]", "properties": { "publicIPAllocationMethod": "Dynamic" } }, { "type": "Microsoft.Network/networkSecurityGroups", "apiVersion": "2021-02-01", "name": "maxServeNSG", "location": "[parameters('location')]", "properties": { "securityRules": [ { "name": "allowHTTP", "properties": { "priority": 100, "protocol": "Tcp", "access": "Allow", "direction": "Inbound", "sourceAddressPrefix": "*", "sourcePortRange": "*", "destinationAddressPrefix": "*", "destinationPortRange": "80", "description": "Allow HTTP traffic on port 80" } }, { "name": "allowSSH", "properties": { "priority": 200, "protocol": "Tcp", "access": "Allow", "direction": "Inbound", "sourceAddressPrefix": "*", "sourcePortRange": "*", "destinationAddressPrefix": "*", "destinationPortRange": "22", "description": "Allow SSH traffic on port 22" } }, { "name": "allowOutbound", "properties": { "priority": 300, "protocol": "Tcp", "access": "Allow", "direction": "Outbound", "sourceAddressPrefix": "*", "sourcePortRange": "*", "destinationAddressPrefix": "*", "destinationPortRange": "*", "description": "Allow all outbound traffic" } } ] } }, { "type": "Microsoft.Network/networkInterfaces", "apiVersion": "2021-03-01", "name": "maxServeNIC", "location": "[parameters('location')]", "dependsOn": [ "[resourceId('Microsoft.Network/publicIPAddresses', 'maxServePublicIP')]", "[resourceId('Microsoft.Network/virtualNetworks', 'maxServeVNet')]", "[resourceId('Microsoft.Network/networkSecurityGroups', 'maxServeNSG')]" ], "properties": { "ipConfigurations": [ { "name": "ipconfig1", "properties": { "subnet": { "id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', 'maxServeVNet', 'maxServeSubnet')]" }, "privateIPAllocationMethod": "Dynamic", "publicIPAddress": { "id": "[resourceId('Microsoft.Network/publicIPAddresses', 'maxServePublicIP')]" } } } ], "networkSecurityGroup": { "id": "[resourceId('Microsoft.Network/networkSecurityGroups', 'maxServeNSG')]" } } }, { "type": "Microsoft.Compute/virtualMachines", "apiVersion": "2021-03-01", "name": "maxServeVM", "location": "[parameters('location')]", "dependsOn": [ "[resourceId('Microsoft.Network/networkInterfaces', 'maxServeNIC')]" ], "plan": { "name": "nvaie_gpu_1_gen2", "publisher": "nvidia", "product": "nvidia-ai-enterprise" }, "properties": { "hardwareProfile": { "vmSize": "[parameters('vmSize')]" }, "osProfile": { "computerName": "maxServeVM", "adminUsername": "[parameters('adminUsername')]", "adminPassword": "[parameters('adminPassword')]" }, "storageProfile": { "imageReference": { "publisher": "nvidia", "offer": "nvidia-ai-enterprise", "sku": "nvaie_gpu_1_gen2", "version": "24.07.03" }, "osDisk": { "createOption": "FromImage", "managedDisk": { "storageAccountType": "Standard_LRS" }, "diskSizeGB": "[parameters('osDiskSizeGB')]" } }, "networkProfile": { "networkInterfaces": [ { "id": "[resourceId('Microsoft.Network/networkInterfaces', 'maxServeNIC')]" } ] } } }, { "type": "Microsoft.Compute/virtualMachines/extensions", "apiVersion": "2021-03-01", "name": "maxServeVM/customScriptExtension", "location": "[resourceGroup().location]", "dependsOn": [ "[resourceId('Microsoft.Compute/virtualMachines', 'maxServeVM')]" ], "properties": { "publisher": "Microsoft.Azure.Extensions", "type": "CustomScript", "typeHandlerVersion": "2.1", "autoUpgradeMinorVersion": true, "settings": { "fileUris": [], "script": "[parameters('startupScript')]" } } } ], "outputs": { "vmName": { "type": "string", "value": "[reference('maxServeVM').osProfile.computerName]" } } } ``` 8. Create the deployment. ```bash az deployment group create \ --name maxServeDeployment \ --resource-group maxServeResourceGroup \ --template-file max-serve-azure.json \ --parameters @parameters.json location="$REGION" ``` 9. Track the status of the deployment by running the following command: ```bash az deployment group wait --name maxServeDeployment \ --resource-group maxServeResourceGroup \ --created ``` ### Retrieve instance information At this point, you should have confirmation that your instance is up and running! Let's get some of the information we need to test the deployment. Let's get the instance ID and public IP address and assign them to environment variables: ``` INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name max-serve-stack --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text --region $REGION) PUBLIC_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION) echo "Instance ID: $INSTANCE_ID" echo "Public IP: $PUBLIC_IP" aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION ``` 1. Get the instance name and zone. Be sure to update the `INSTANCE_NAME` variable if you changed it from `max-serve-instance`. ```bash INSTANCE_NAME=max-serve-instance ZONE=$(gcloud compute instances list \ --filter="name:${INSTANCE_NAME}" \ --format="value(zone)") echo "Instance Name: $INSTANCE_NAME" echo "Zone: $ZONE" ``` 2. Add a tag to the instance. ```bash gcloud compute instances add-tags "${INSTANCE_NAME}" \ --project=${PROJECT_ID} \ --zone "${ZONE}" \ --tags http-server ``` 3. Retrieve the public IP address for the instance: ```bash PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \ --zone "${ZONE}" \ --format="get(networkInterfaces[0].accessConfigs[0].natIP)" \ --project=${PROJECT_ID}) echo "Public IP: $PUBLIC_IP" ``` Get the public IP address of our deployment. ```bash PUBLIC_IP=$(az network public-ip show \ --resource-group maxServeResourceGroup \ --name maxServePublicIP \ --query ipAddress -o tsv) ``` ### Test the endpoint We've confirmed that the instance is available. However, it can still take a few minutes to pull the MAX Docker image and start it. In this section, you'll learn how to check to see if the service is ready to receive inference requests, then run a `curl` command to send and receive a request to the container. To track when the instance is ready, you can use the AWS CloudWatch console to view the log group, `/aws/ec2/max-serve-stack-logs` and find the logs for `instance-logs`. Alternatively, you can use the following bash script: check-logs.sh ```bash REGION=$1 STACK_NAME=$2 MAX_WAIT_MINUTES=30 START_TIME=$(date +%s) LOG_GROUP="/aws/ec2/$STACK_NAME-logs" fetch_logs() { local stream_name=$1 local stream_type=$2 local limit=$3 echo "=== $stream_type Logs ===" if [ -n "$limit" ]; then aws logs get-log-events \ --log-group-name "$LOG_GROUP" \ --log-stream-name "$stream_name" \ --limit $limit \ --region $REGION \ --query 'events[*].[timestamp,message]' \ --output text else aws logs get-log-events \ --log-group-name "$LOG_GROUP" \ --log-stream-name "$stream_name" \ --start-time $(($(date +%s) - 60))000 \ --region $REGION \ --query 'events[*].[timestamp,message]' \ --output text fi echo "====================" } check_server_status() { local logs=$1 echo "🔍 Checking logs for server status..." # Check for Uvicorn startup message in container logs if echo "$logs" | grep -q "Server ready on http://0.0.0.0:8000" || echo "$logs" | grep -q "Application startup complete"; then echo "✅ Found server running message" return 0 fi echo "❌ Server running message not found" return 1 } echo "🔍 Starting monitoring for MAX server (max wait: ${MAX_WAIT_MINUTES} minutes)..." while true; do current_time=$(date +%s) elapsed_minutes=$(((current_time - START_TIME) / 60)) if [ $elapsed_minutes -ge $MAX_WAIT_MINUTES ]; then echo "❌ Timeout after ${MAX_WAIT_MINUTES} minutes. Server might still be starting up." exit 1 fi EC2_LOG_STREAM=$(aws logs describe-log-streams \ --log-group-name "$LOG_GROUP" \ --log-stream-name-prefix "instance-logs" \ --region $REGION \ --query "logStreams[0].logStreamName" \ --output text) echo "⏳ Checking logs... (${elapsed_minutes}/${MAX_WAIT_MINUTES} minutes)" if [ "$EC2_LOG_STREAM" != "None" ]; then echo "📜 Instance Logs:" EC2_LOGS=$(fetch_logs "$EC2_LOG_STREAM" "Instance" 50) if check_server_status "$EC2_LOGS"; then echo "✅ Server is ready! (took ${elapsed_minutes} minutes)" echo "📋 Latest logs:" echo "$EC2_LOGS" exit 0 fi else echo "⏳ Logs not yet available..." fi echo "⏳ Server still starting up... checking again in 60 seconds" echo "-------------------------------------------" sleep 60 done ``` The instance is ready when you can see a log entry similar to the following: ``` Server ready on http://0.0.0.0:8000 ``` After you see this log entry, you can test the endpoint by running the following `curl` command: ``` curl -N http://$PUBLIC_IP/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5b-instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of Mongolia"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` 1. Assign the instance ID to an environment variable, `INSTANCE_ID`. ```bash INSTANCE_ID=$(gcloud compute instances describe ${INSTANCE_NAME} \ --zone=${ZONE} \ --project=${PROJECT_ID} \ --format="value(id)") ``` 2. Get the current logs by running the following command: ```bash gcloud logging read \ "resource.type=gce_instance AND \ resource.labels.instance_id=${INSTANCE_ID} AND \ jsonPayload.message:*" \ --project=${PROJECT_ID} \ --format="table(timestamp,jsonPayload.message)" \ --limit=10 ``` The instance is ready when you can see a log entry similar to the following: ```output uvicorn running on http://0.0.0.0:8000 ``` 3. Test the endpoint by sending the following `curl` request: ```bash curl -N http://$PUBLIC_IP:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of Mongolia"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` 1. Verify that the container is running. ```bash ssh azuresuer@$PUBLIC_IP # Use the password that you set in your parameters.json file. sudo cat /var/log/azure/custom-script/handler.log sudo cat /var/lib/waagent/custom-script/download/0/stdout sudo cat /var/lib/waagent/custom-script/download/0/stderr ``` The instance is ready when you can see a log entry similar to the following: ```output uvicorn running on http://0.0.0.0:8000 ``` 3. Test the endpoint by sending the following `curl` request: ```bash curl -N http://$PUBLIC_IP:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of Mongolia?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` You should see a response in your command line similar to the following: ```output The capital city of Mongolia is Ulaanbaatar. ``` ### Delete the cloud resources Take a few minutes to explore your deployment. When you're finished, be sure to delete the resources created in this tutorial so you don't incur any unnecessary charges. 1. Delete the stack. ```bash aws cloudformation delete-stack --stack-name max-serve-stack ``` 2. Verify that the stack deleted successfully. ``` aws cloudformation describe-stacks --stack-name max-serve-stack \ --region $REGION --query 'Stacks[0].StackStatus' --output text ``` ``` gcloud deployment-manager deployments delete max-serve-deployment \ --project=${PROJECT_ID} ``` ```bash az group delete --name maxServeResourceGroup ``` ## Next steps In this tutorial, you've deployed a Hugging Face Pytorch model to the cloud using a MAX Docker container. Keep in mind that this is just a preview of MAX Serve for PyTorch models and it's currently compatible with LLMs only. We're working on support for more models and more model optimizations with the MAX graph compiler. Here are some other topics to explore next: export const cards = [ { title: 'Deploy Llama 3 on GPU with MAX Serve', link: '/max/tutorials/max-serve-local-to-cloud', description: `Learn how to deploy Llama 3 on GPU with MAX Serve.`, }, { title: 'Benchmark MAX Serve on an NVIDIA H100 GPU', link: '/max/tutorials/benchmark-max-serve', description: `Learn how to use our benchmarking script to measure the performance of MAX Serve.`, }, { title: 'Bring your own fine-tuned model to MAX pipelines', link: '/max/tutorials/max-pipeline-bring-your-own-model', description: `Learn how to customize your own model in MAX pipelines.`, }, { title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters', link: '/max/tutorials/deploy-max-serve-on-kubernetes', description: `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`, }, ]; To stay up to date with new releases, [sign up for our newsletter](https://www.modular.com/modverse#signup) and [join our community](https://www.modular.com/community). And if you're interested in becoming a design partner to get early access and give us feedback, please [contact us](https://www.modular.com/company/contact). --- ## Deploy Llama 3 on GPU with MAX Serve import SmallCards from '@site/src/components/SmallCards'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import Requirements from '@site/src/components/Requirements'; import { requirementsWithGPU } from '@site/docs/max/requirements'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; This guide walks through serving Llama 3 models with MAX Serve, from local testing to production deployment on major cloud platforms. You'll learn to automate the deployment process using Infrastructure-as-Code (Iac) and optimize performance with GPU resources. MAX Serve provides a streamlined way to deploy large language models (LLMs) with production-ready features like GPU acceleration, automatic scaling, and monitoring capabilities. Whether you're building a prototype or preparing for production deployment, this guide will help you set up a robust serving infrastructure for Llama 3. The tutorial is organized into the following sections: - **[Local setup](#local-setup)**: Run Llama 3 locally to verify its basic functionality. - **[Cloud deployment](#cloud-deployment)**: Deploy Llama 3 to AWS, GCP, or Azure using IaC templates and CLI commands. System requirements: ## Local setup In this section, you will set up and run Llama 3 locally to understand its capabilities and validate functionality before moving to the cloud. ### 1. Set up your environment Create a Python project to install our APIs and CLI tools. ### 2. Run Llama 3 locally Next, use the `max` CLI tool to interact with the Llama 3 model locally and ensure that the model runs as expected before deploying it in the cloud. 1. Export your Hugging Face token. To create a Hugging Face user access token, see [Access Tokens](https://huggingface.co/settings/tokens). ```bash export HF_TOKEN="" ``` 2. Generate a response to a prompt with the following command: ```bash max generate --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \ --prompt "What is the meaning of life?" \ --max-length 250 ``` :::note Available flags Use the `max generate --help` command to explore available flags such as `--devices`. Supported GPUs include NVIDIA H100, A100, A10G, L4, and L40. ::: 3. Start the model server using `max serve`. The `--model-path` flag specifies which model to load. ```bash max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF ``` This starts a local server where you can test Llama 3's response generation capabilities. :::note GPU-enabled Docker containers We provide a pre-configured GPU-enabled Docker container that simplifies deployment. For more information, see [MAX container](/max/container). We'll use the MAX container later in the [cloud deployment](#cloud-deployment) steps. This container includes all necessary dependencies and configurations for running Llama 3 with GPU acceleration. ::: ### 3. Test the local endpoint After starting the model server, you can test its functionality by sending a `curl` request from a new window: ```bash curl -N http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the World Series in 2020?"} ] }' | jq -r '.choices[].message.content' ``` After starting your server, you can go to [http://0.0.0.0:8000/docs](http://0.0.0.0:8000/docs) to learn more about available endpoints and API specifications. Now that the model works locally, we'll transition to cloud deployment. ## Cloud deployment paths {#cloud-deployment} We will use Infrastructure-as-Code (IaC) to create, configure, and deploy Llama 3 in the cloud. The cloud deployment instructions are divided by provider: AWS, GCP, and Azure. ### Cloud deployment overview For AWS, we will use CloudFormation, for GCP, we will use Deployment Manager, and for Azure, we will use Resource Manager. These IaC templates handle resource provisioning, networking, and security configuration. This approach simplifies deployments and ensures they are repeatable. The key steps are: - **Create and Deploy Stack/Resources**: Use IaC templates for each cloud provider to deploy Llama 3. - **Test the Endpoint**: Retrieve the public IP address after deployment and send a request to test the Llama 3 endpoint in the cloud. Each cloud-specific tab provides complete commands for setup, configuration, deployment, and testing. To better understand the flow of the deployment, here is a high-level overview of the architecture: Figure 1. Architecture diagram of the cloud stack for deploying MAX Serve. This architecture diagram illustrates the two-phase deployment setup for serving the Llama 3 model with MAX on cloud provider infrastructure. The deployment process is divided into two phases: * **Phase 1: Cloud stack creation**: In this initial phase, the following infrastructure is provisioned and configured to prepare for serving requests: * **Public IP assignment**: The cloud provider assigns a public IP to the virtual machine (VM), allowing it to be accessed externally. * **Firewall/Security group configuration**: Security settings, such as firewall rules or security groups, are applied to allow traffic on port 80. This setup ensures that only HTTP requests can access the instance securely. * **GPU compute instance setup**: A GPU-enabled VM is created to handle model inference efficiently. This instance includes: * **GPU drivers/runtime installation**: Necessary GPU drivers and runtime libraries are installed to enable hardware acceleration for model processing. * **Docker container initialization**: A Docker container is launched on the VM, where it pulls the necessary images from the Docker Container Registry. This registry serves as a central repository for storing Docker images, making it easy to deploy and update the application. Inside the container, MAX Serve is set up alongside the Llama 3 model. This setup prepares the environment for serving requests but does not yet expose the endpoint to users. :::note GPU-enabled Docker containers The pre-configured GPU-enabled Docker container includes all necessary dependencies and configurations for running Llama 3 with GPU acceleration. The provided IaC templates initialize the MAX container. If you don't use the provided templates for infrastructure set up, you can initialize the container image with the `docker run` command. For more information, see [MAX container](/max/container). ::: * **Phase 2: Serving the user endpoint**: Once the cloud stack is configured and the VM is set up, the deployment enters the second phase, where it starts serving user requests: * **HTTP endpoint exposure**: With the VM and Docker container ready, the system opens an OpenAI compatible HTTP endpoint on port 80, allowing users to interact with the deployed Llama 3 model. * **Request handling by MAX Serve**: When a user sends an HTTP request to the public IP, MAX Serve processes the incoming request within the Docker container and forwards it to the Llama 3 model for inference. The model generates a response, which is then returned to the user via the endpoint. :::caution For the sake of this tutorial, we expose the public IP address of the VM to the internet. This is not recommended for direct use in production environments as it may expose your model to security risks. ::: ### Prerequisites Be sure that you have the following prerequisites, as well as appropriate access and permissions for the cloud provider of your choice. - **GPU resources**: You'll need access to GPU resources in your cloud account with the following specifications: - **Minimum GPU memory**: 24GB - **Supported GPU types**: NVIDIA H100, A100, A10G, L4 and L40 :::note This tutorial has been tested on `g5.4xlarge` (A10G 24GB) on AWS, `g2-standard-8` (L4 32GB) on GCP, and `Standard_NV36ads_A10_v5` (A10G 24GB) on Azure ::: - **A Hugging Face user access token**: A valid Hugging Face token is required to access the model. To create a Hugging Face user access token, see [Access Tokens](https://huggingface.co/settings/tokens). You must make your token available in your environment with the following command: ```bash export HF_TOKEN="" ``` - **Docker installation**: Install the [Docker Engine and CLI](https://docs.docker.com/engine/install/). We use a pre-configured GPU-enabled Docker container from our public repository. The container image (`docker.modular.com/modular/max-nvidia-full:latest`) is available on [Docker Hub](https://hub.docker.com/r/modular/max-nvidia-full). For more information, see [MAX container](/max/container). - **Cloud CLI tools**: Before deploying, ensure that you have the respective cloud provider CLI tools installed. - [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) installed and configured with appropriate credentials - [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) installed and initialized - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) installed and logged in and configured Configure the AWS CLI: ```bash aws configure ``` Log in to your AWS account: ```bash aws sso login ``` Check the credentials via `cat ~/.aws/credentials` to make sure it is set up correctly. You can also include the credentials as environment variables: ```bash export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID" export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY" ``` Initialize the Google Cloud SDK: ```bash gcloud init ``` Log in to your Google Cloud account: ```bash gcloud auth login ``` Initialize the Azure CLI: ```bash az init ``` Log in to your Azure account: ```bash az login ``` ### 1. Create stack/deployment In this section, we'll walk through creating a deployment stack on AWS, GCP, and Azure. Each cloud provider has its own configuration steps, detailed below, but we simplify the setup by using Infrastructure-as-Code (IaC) templates. Start by cloning the MAX repository and navigating to the `max/tutorials/max-serve-cloud-configs/` directory, where the necessary IaC templates and configuration files are organized for each cloud provider. ```bash git clone -b stable https://github.com/modular/modular && cd max/tutorials/max-serve-cloud-configs ``` This directory includes all files required to deploy the MAX Serve setup to AWS, GCP, or Azure: ```bash max/tutorials/max-serve-cloud-configs/ ├── aws │ ├── max-serve-aws.yaml │ └── notify.sh ├── azure │ ├── max-serve-azure.json │ └── notify.sh └── gcp ├── max-serve-gcp.jinja └── notify.sh ``` With these IaC templates ready, choose your preferred cloud provider and follow the step-by-step instructions specific to each platform. :::note Preparing the deployment takes some time Stack creation may take some time to complete and completion times differ across cloud providers. ::: First navigate to the AWS directory: ```bash cd aws ``` Set the region in your environment: ```bash export REGION="REGION" # example: `us-east-1` ``` Then, create the stack. You can explore the `max-serve-aws.yaml` file for AWS CloudFormation configuration information. :::note Stack naming The stack name must be **unique** so please be sure to change the `--stack-name` if you create multiple stacks. ::: ```bash export STACK_NAME="max-serve-stack" aws cloudformation create-stack --stack-name ${STACK_NAME} \ --template-body file://max-serve-aws.yaml \ --parameters \ ParameterKey=InstanceType,ParameterValue=g5.4xlarge \ ParameterKey=HuggingFaceHubToken,ParameterValue=${HF_TOKEN} \ ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/Llama-3.1-8B-Instruct-GGUF \ --capabilities CAPABILITY_IAM \ --region $REGION ``` :::note GCP access requirements You must have access to `deploymentmanager.googleapis.com`, `logging.googleapis.com`, `compute.googleapis.com` and be able to use `gcloud compute firewall-rules` to configure inbound traffic. ::: First, navigate to the GCP directory: ```bash cd gcp ``` Set the project ID: ```bash PROJECT_ID="YOUR PROJECT ID" export ZONE="ZONE" # example `us-east1-d` ``` Enable the required APIs: ```bash gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \ gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \ gcloud services enable compute.googleapis.com --project=${PROJECT_ID} ``` Create the deployment with the following command. You can explore the `max-serve-gcp.jinja` file for more information on the Deployment Manager configuration. :::note Deployment naming The deployment name must be **unique** so please be sure to change the `DEPLOYMENT_NAME` if you create multiple deployments. ::: ```bash export DEPLOYMENT_NAME="max-serve-deployment" export INSTANCE_NAME="max-serve-instance" gcloud deployment-manager deployments create ${DEPLOYMENT_NAME} \ --template max-serve-gcp.jinja \ --properties "\ instanceName:${INSTANCE_NAME},\ zone:${ZONE},\ machineType:g2-standard-8,\ acceleratorType:nvidia-l4,\ acceleratorCount:1,\ sourceImage:common-cu123-v20240922-ubuntu-2204-py310,\ huggingFaceHubToken:${HF_TOKEN},\ huggingFaceRepoId:modularai/Llama-3.1-8B-Instruct-GGUF" \ --project ${PROJECT_ID} ``` First, navigate to the Azure directory: ```bash cd azure ``` Set the region: ```bash export REGION="REGION" # example `westus3` ``` Then, create the resource group: :::note Resource group and deployment naming If you receive an error about resource group location conflicts, it means the resource group already exists in a different location. You can either: - Use a new resource group name - Use the existing resource group's location Additionally, the deployment name must be **unique** so please be sure to change the `DEPLOYMENT_NAME` if you create multiple deployments. ::: ```bash export RESOURCE_GROUP_NAME="maxServeResourceGroup" export DEPLOYMENT_NAME="maxServeDeployment" az group create --name ${RESOURCE_GROUP_NAME} --location ${REGION} ``` Check the status of the resource group: ```bash az group show -n ${RESOURCE_GROUP_NAME} --query properties.provisioningState -o tsv ``` Create and encode the startup script: ```bash STARTUP_SCRIPT='#!/bin/bash sudo usermod -aG docker $USER sudo systemctl restart docker sleep 10 HF_TOKEN=$1 HUGGING_FACE_REPO_ID=${2:-modularai/Llama-3.1-8B-Instruct-GGUF} sudo docker run -d \ --restart unless-stopped \ --env "HF_TOKEN=${HF_TOKEN}" \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ -v $HOME/.cache/huggingface:/root/.cache/huggingface \ --gpus 1 \ -p 80:8000 \ --ipc=host \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path ${HUGGING_FACE_REPO_ID}' export STARTUP_SCRIPT=$(echo "$STARTUP_SCRIPT" | base64) ``` Then, create the deployment: :::note NVIDIA license agreement You may be required to accept the Azure Marketplace image terms for the NVIDIA AI enterprise image: ```bash az vm image terms accept --urn nvidia:nvidia-ai-enterprise:nvaie_gpu_1_gen2:latest ``` ::: :::caution Set an admin password Replace `YOUR-SECURE-PASSWORD-123` with your own secure password to be able to `ssh` into the VM that we will use later. ::: ```bash export VM_PASSWORD="YOUR-SECURE-PASSWORD-123" az deployment group create \ --name ${DEPLOYMENT_NAME} \ --resource-group ${RESOURCE_GROUP_NAME} \ --template-file max-serve-azure.json \ --parameters \ adminUsername="azureuser" \ adminPassword=${VM_PASSWORD} \ vmSize="Standard_NV36ads_A10_v5" \ osDiskSizeGB=128 \ vnetAddressPrefix="10.0.0.0/16" \ subnetAddressPrefix="10.0.0.0/24" \ startupScript="${STARTUP_SCRIPT}" \ location="${REGION}" ``` ### 2. Wait for resources to be ready In this step, we'll wait for the resources to be ready. Stack and deployment creation may take some time to complete. ```bash aws cloudformation wait stack-create-complete \ --stack-name ${STACK_NAME} \ --region ${REGION} ``` ```bash gcloud deployment-manager deployments describe ${DEPLOYMENT_NAME} \ --project=${PROJECT_ID} ``` Wait for the deployment to be completed and report its status: ```bash az deployment group wait \ --name ${DEPLOYMENT_NAME} \ --resource-group ${RESOURCE_GROUP_NAME} \ --created ``` ### 3. Retrieve instance information After the resources are deployed, you'll need to get the instance information, such as the public DNS or IP address that we will use to test the endpoint. ```bash INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \ --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \ --output text \ --region ${REGION}) PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \ --query 'Reservations[0].Instances[0].PublicIpAddress' \ --output text \ --region ${REGION}) echo "Instance ID: ${INSTANCE_ID}" echo "Public IP: ${PUBLIC_IP}" aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION} ``` First, check if the firewall rule already exists: ```bash EXISTING_RULE=$(gcloud compute firewall-rules list \ --filter="name=allow-http" \ --format="value(name)" \ --project=${PROJECT_ID}) if [ -z "$EXISTING_RULE" ]; then echo "Creating firewall rule..." gcloud compute firewall-rules create allow-http \ --allow tcp:80 \ --source-ranges 0.0.0.0/0 \ --target-tags http-server \ --description "Allow HTTP traffic on port 80" \ --project=${PROJECT_ID} else echo "Firewall rule 'allow-http' already exists" fi ``` Check if the instance exists and tag it with `http-server`: ```bash INSTANCE_EXISTS=$(gcloud compute instances list \ --filter="name=${INSTANCE_NAME}" \ --format="value(name)" \ --project=${PROJECT_ID}) if [ -n "$INSTANCE_EXISTS" ]; then echo "Adding tags to instance ${INSTANCE_NAME}" gcloud compute instances add-tags "${INSTANCE_NAME}" \ --project=${PROJECT_ID} \ --zone "${ZONE}" \ --tags http-server else echo "Error: Instance ${INSTANCE_NAME} not found" exit 1 fi ``` Then, get the public IP: ```bash PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \ --zone "${ZONE}" \ --format="get(networkInterfaces[0].accessConfigs[0].natIP)" \ --project=${PROJECT_ID}) echo "Public IP: $PUBLIC_IP" ``` ```bash PUBLIC_IP=$(az network public-ip show \ --resource-group ${RESOURCE_GROUP_NAME} \ --name maxServePublicIP \ --query ipAddress -o tsv) echo "Public IP: ${PUBLIC_IP}" ``` ### 4. Test the endpoint :::note Wait until the server is ready to test the endpoint It will take some time for the stack or deployment to pull the MAX serve Docker image and set it up for serving. We need to wait for the Docker logs to appear and then make sure that the Docker container is running on port `8000`. The server is ready when you see the following log: ```output Server ready on http://0.0.0.0:8000 ``` We provide a simple script to monitor the startup progress and notify you when the server is ready. For AWS, you can see the logs in the AWS CloudWatch UI within the log group `/aws/ec2/${STACK_NAME}-logs` and log stream `instance-logs`. Alternatively, you can use the provided bash script to monitor the logs until the server is ready: ```bash bash notify.sh ${REGION} ${STACK_NAME} ${PUBLIC_IP} ``` For GCP, first make sure that the Docker container is running on port `8000`. You can view the logs in the Compute Engine VM instances UI. Within the UI, choose **Observability**, then choose **Logs**. Alternatively, you can use the provided bash script to monitor the logs until the server is ready: ```bash bash notify.sh ${PROJECT_ID} ${INSTANCE_NAME} ${ZONE} ${PUBLIC_IP} ``` For Azure, you can monitor the Docker container status (running on port `8000`) using one of the following methods: #### Option 1: Use the monitoring script 1. Install the required dependencies for the monitoring script: - Install [sshpass](https://www.cyberciti.biz/faq/noninteractive-shell-script-ssh-password-provider/) on your local machine to enable automated SSH password authentication 2. Set up and run the monitoring script: ```bash bash notify.sh ${RESOURCE_GROUP_NAME} ${VM_PASSWORD} ${PUBLIC_IP} ``` #### Option 2: Manual SSH access 1. Connect to the VM: ```bash ssh azureuser@$PUBLIC_IP ``` > **Note:** Use the password you set previously when creating the deployment. 2. View the startup logs: ```bash sudo cat /var/log/azure/custom-script/handler.log sudo cat /var/lib/waagent/custom-script/download/0/stdout sudo cat /var/lib/waagent/custom-script/download/0/stderr sudo docker logs $(docker ps -q -f ancestor=docker.modular.com/modular/max-nvidia-full:latest) ``` Both methods will help you confirm that the server is running correctly. The logs will show the startup progress and any potential issues that need to be addressed. ::: We will use the public IP address that we obtained from previous step to test the endpoint with the following `curl` request: :::tip After the server starts, there may be a brief delay before the cloud provider exposes the public IP address. If you receive an error, please wait approximately one minute and try again. ::: ```bash curl -N http://$PUBLIC_IP/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the World Series in 2020?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` :::note Benchmarking MAX Serve You can also use the public IP address of your deployed MAX Serve endpoint to benchmark the performance of Llama 3.1. MAX includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. For more detailed instructions on benchmarking, please see [Benchmark MAX Serve](https://github.com/modular/modular/tree/main/benchmark). ::: ### 5. Delete the cloud resources Cleaning up resources to avoid unwanted costs is critical. Use the following commands to delete resources for each platform. This section provides steps to safely terminate all resources used in the tutorial. First, delete the stack: ```bash aws cloudformation delete-stack --stack-name ${STACK_NAME} ``` Wait for the stack to be deleted: ```bash aws cloudformation wait stack-delete-complete \ --stack-name ${STACK_NAME} \ --region ${REGION} ``` ```bash gcloud deployment-manager deployments delete ${DEPLOYMENT_NAME} \ --project=${PROJECT_ID} ``` ```bash az group delete --name ${RESOURCE_GROUP_NAME} ``` ### Cost estimate When deploying Llama 3 in a cloud environment, several cost factors come into play: **Primary cost components:** - **Compute Resources**: GPU instances (like AWS `g5.4xlarge`, GCP `g2-standard-8`, or Azure `Standard_NV36ads_A10_v5`) form the bulk of the costs - **Network Transfer**: Costs associated with data ingress/egress, which is critical for high-traffic applications - **Storage**: Expenses for boot volumes and any additional storage requirements - **Additional Services**: Costs for logging, monitoring, and other supporting cloud services For detailed cost estimates specific to your use case, we recommend using these official pricing calculators: - [AWS Pricing Calculator](https://calculator.aws) - [GCP Pricing Calculator](https://cloud.google.com/products/calculator) - [Azure Pricing Calculator](https://azure.microsoft.com/en-us/pricing/calculator/) :::tip Cloud cost optimization tips: - Consider using spot/preemptible instances for non-critical workloads - Implement auto-scaling to match resource allocation with demand - Monitor and optimize network usage patterns - Set up cost alerts and budgets to avoid unexpected charges Remember to factor in your expected usage patterns, regional pricing differences, and any applicable enterprise discounts when calculating total cost of ownership (TCO). ::: ## Next steps Congratulations on successfully running MAX Pipelines locally and deploying Llama 3 to the cloud! 🎉 Now that you've mastered the essentials of setting up and deploying the Llama 3 model with MAX Serve, here are some other topics to explore next: export const cards = [ { title: 'Deploy a PyTorch model from Hugging Face', link: '/max/tutorials/deploy-pytorch-llm', description: `Learn how to deploy a PyTorch model to the cloud using MAX Serve.`, }, { title: 'Benchmark MAX Serve on an NVIDIA H100 GPU', link: '/max/tutorials/benchmark-max-serve', description: `Learn how to use our benchmarking script to measure the performance of MAX Serve.`, }, { title: 'Bring your own fine-tuned model to MAX pipelines', link: '/max/tutorials/max-pipeline-bring-your-own-model', description: `Learn how to customize your own model in MAX pipelines.`, }, { title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters', link: '/max/tutorials/deploy-max-serve-on-kubernetes', description: `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`, }, ]; To stay up to date with new releases, [sign up for our newsletter](https://www.modular.com/modverse#signup) and [join our community](https://www.modular.com/community). And if you're interested in becoming a design partner to get early access and give us feedback, please [contact us](https://www.modular.com/company/contact). --- ## Deploy Llama 3 on GPU-powered Kubernetes clusters import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import Requirements from '@site/src/components/Requirements'; import { requirementsWithGPU } from '@site/docs/max/requirements'; MAX simplifies the process to deploy an LLM with high performance on GPUs. And if you want to deploy at scale using Kubernetes' built-in monitoring, scaling, and cluster management, then you're in the right place. In this tutorial, you'll learn how to deploy our MAX container on Kubernetes, using your pick of AWS, GCP, or Azure. You'll create a GPU-enabled Kubernetes cluster with your chosen cloud provider (AWS, GCP, or Azure), then use Helm to deploy the MAX container, which provides an OpenAI-compatible endpoint for making inference requests. :::note GPU required When selecting your cloud environment, make sure it includes a [compatible GPU](/max/faq#gpu-requirements). ::: ## Set up your environment Most of this tutorial involves interaction with your cloud service, so make sure you have the appropriate access and permissions. Most importantly, this tutorial uses GPU-powered Kubernetes clusters that may require special privileges. 1. To get started, select your cloud provider and install the corresponding required tools: To work with AWS, you'll need to install and configure two command-line tools. Begin by installing the AWS CLI using the [AWS CLI installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html), then install eksctl following the [eksctl installation guide](https://eksctl.io/installation/). After installation, authenticate your AWS account using: ```bash aws configure ``` This will prompt you for your AWS credentials. For a complete setup walkthrough, refer to the [AWS authentication guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) or the [Amazon EKS setup documentation](https://docs.aws.amazon.com/eks/latest/userguide/setting-up.html). Start by installing the Google Cloud CLI following the [GCP CLI installation guide](https://cloud.google.com/sdk/docs/install-sdk). Once installed, authenticate and configure your GCP environment: ```bash gcloud auth login gcloud config set project YOUR_PROJECT_ID gcloud config set compute/region us-central1-a ``` For additional configuration options, see the [GCP authentication guide](https://cloud.google.com/docs/authentication/gcloud). Begin by installing the Azure CLI following the [Azure CLI installation guide](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli). Once installed, authenticate your Azure account: ```bash az login ``` This command will open your default browser to complete the authentication process. For additional authentication methods, consult the [Azure authentication guide](https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli). 2. **Install additional tools**: 1. Install kubectl: Follow the [kubectl installation guide](https://kubernetes.io/docs/tasks/tools/). 2. Install Helm: Follow the [Helm installation guide](https://helm.sh/docs/intro/install/). Now that you have the prerequisites out of the way, you can create a Kubernetes cluster with GPU nodes on your preferred cloud provider. ## Create a Kubernetes cluster with GPU nodes To get started, you'll need a Kubernetes cluster equipped with GPU nodes to handle the compute demands of LLM inference. We recommend using _NVIDIA's A100_ instances for their high performance and efficiency in AI workloads. {/** For more information on instances, see [Compatible cloud instances and virtual machines](). **/} Run the following command to create a cluster with an full OpenID Connect (OIDC) provider for authentication, private networking, full Elastic Container Registry (ECR) access, and multi-zone deployment: ```bash eksctl create cluster \ --name max-cluster \ --region us-east-1 \ --node-type p4d.24xlarge \ --nodes 1 ``` For more information on `eksctl create cluster`, see [Create an Amazon EKS Cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html). Run the following command to create a GKE cluster with GPU nodes configured with autoscaling and network policies: ```bash gcloud container clusters create max-cluster \ --region us-central1 \ --node-locations us-central1-a \ --machine-type a2-highgpu-1g \ --num-nodes 1 \ --accelerator type=nvidia-tesla-a100,count=1 ``` Then set up the required NVIDIA driver: ```bash kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml ``` For more information on `gcloud container clusters create`, see [Creating a zonal cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-zonal-cluster). First, run the following command to create a resource group in your chosen region: ```bash az group create --name my-resource-group --location eastus ``` Then, run the following command to create the AKS cluster: ```bash az aks create \ --resource-group my-resource-group \ --name max-cluster \ --node-count 1 \ --generate-ssh-keys \ --node-vm-size "standard_nc24ads_a100_v4" ``` After the cluster is created, configure your local environment to connect to it by retrieving the cluster credentials: ```bash az aks get-credentials --resource-group my-resource-group --name max-cluster ``` For more information on `az aks create`, see [Deploy an AKS cluster using Azure CLI](https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli). ## Set up a Kubernetes namespace Next, we'll create a dedicated namespace: ```bash kubectl create namespace max-openai-api-demo ``` {/** Then, set your Hugging Face token: ```bash kubectl create secret generic huggingface-secret \ --from-literal=HF_TOKEN=your_token_here \ --namespace max-openai-api-demo ``` */} Then set this namespace as our default: ```bash kubectl config set-context --current --namespace=max-openai-api-demo ``` ## Deploy using Helm Now we'll deploy the Llama 3.1 model graph with MAX using Helm: ```bash helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \ --version 25.1.0 \ --namespace max-openai-api-demo \ --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF \ --set maxServe.maxLength=512 \ --set maxServe.maxBatchSize=16 \ --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \ --timeout 15m0s \ --wait ``` Now we'll deploy the Llama 3.1 model graph with MAX using Helm: ```bash helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \ --version 25.1.0 \ --namespace max-openai-api-demo \ --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF \ --set maxServe.maxLength=512 \ --set maxServe.maxBatchSize=16 \ --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \ --set "resources.limits.nvidia\\.com/gpu=1" \ --set "resources.requests.nvidia\\.com/gpu=1" \ --timeout 15m0s \ --wait ``` Now we'll deploy the Llama 3.1 model graph with MAX using Helm: ```bash helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \ --version 25.1.0 \ --namespace max-openai-api-demo \ --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF \ --set maxServe.maxLength=512 \ --set maxServe.maxBatchSize=16 \ --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \ --timeout 15m0s \ --wait ``` Resolve error getting credentials If you encounter this error when running the Helm install command: ```output Error: INSTALLATION FAILED: error getting credentials - err: exec: "docker-credential-desktop": executable file not found in $PATH, out: ``` This occurs because Helm is trying to use Docker Desktop's credential helper, but it's not available in your PATH. To resolve this, configure Docker to use the basic credential store: ```bash docker login ``` After applying, retry your Helm install command. When you run this command, Helm begins a multi-stage deployment process. First, it pulls the MAX container image from Docker Hub, which includes the Llama model. Next, it downloads the Llama 3.1 GGUF model weights. Finally, it configures and launches the model as an endpoint, making it accessible on port `8000`. You'll need to set up port forwarding to access this endpoint. :::note Use `--set envSecret.HF_TOKEN=` if your model is a gated model and requires a Hugging Face token. ::: ## Verify and test the deployment After deploying, follow these steps to verify and test your deployment: 1. Watch the pod status to ensure it's running: ```bash kubectl get pods -w ``` 2. Check the logs for any startup issues: ```bash kubectl logs -f POD_NAME ``` 3. Set up port forwarding to access the service locally: 1. Get the name of your MAX pod: ```bash POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}") ``` 2. Retrieve the container port that MAX is listening on: ```bash CONTAINER_PORT=$(kubectl get pod $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}") ``` 3. Make the MAX endpoint accessible on `localhost:8000`: ```bash kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT & ``` ## Send an inference request Now that your deployment is verified and port forwarding is set up, you can test the model by sending it a chat request. You will use [OpenAI's chat completion](https://platform.openai.com/docs/guides/text-generation) endpoint to send the request. Open a new tab in your terminal and run the following command: ```bash curl -N http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }' ``` The following is the expected output: ```output The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988. ``` ## Monitoring Once deployed, you can monitor your deployment's health and performance. The following optional commands will help you monitor your deployment: - Check pod logs: ```bash kubectl logs -f $POD_NAME ``` - Monitor node resources: ```bash kubectl top nodes ``` - Monitor pod resources: ```bash kubectl top pods ``` - Monitor GPU utilization: ```bash kubectl exec -it $(kubectl get pods --namespace max-openai-api-demo -l app.kubernetes.io/name=max-openai-api-chart -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi ``` For more information on benchmarking and additional performance metrics, see [Benchmark MAX performance](https://github.com/modular/modular/tree/main/benchmark). ## Cleanup When you're done testing or need to tear down the environment: 1. Uninstall the Helm release: ```bash helm uninstall max-openai-api --namespace max-openai-api-demo ``` 2. Delete the Kubernetes namespace: ```bash kubectl delete namespace max-openai-api-demo ``` 3. Delete your Kubernetes cluster: The following command deletes an Amazon EKS cluster and all associated resources in a specified region: ```bash eksctl delete cluster --name max-cluster --region us-east-1 ``` For more information on `eksctl delete cluster`, see [Delete a cluster](https://docs.aws.amazon.com/eks/latest/userguide/delete-cluster.html). The following command deletes a GKE cluster and its associated resources in a specified zone: ```bash gcloud container clusters delete max-cluster ``` For more information on `gcloud container clusters delete`, see [Deleting a cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/deleting-a-cluster). The following command deletes an AKS cluster and its associated resources in a specified resource group: ```bash az aks delete --resource-group my-resource-group --name max-cluster ``` For more information on `az aks delete`, see [Delete an Azure Kubernetes Service cluster](https://learn.microsoft.com/en-us/azure/aks/delete-cluster). ## Next steps You now have a GPU-powered MAX deployment running in the cloud, ready to handle LLM inference at scale with features like optimized GPU utilization, automatic scaling, and robust monitoring. Be sure to monitor performance and costs, and tailor configurations to your specific workload needs. Keep in mind that this is just a preview of MAX on NVIDIA GPUs. We're working hard to add support for more hardware, including AMD GPUs, and optimize performance for more GenAI models. To stay up to date with new releases, [sign up for our newsletter](https://www.modular.com/modverse#signup), [check out the community](https://www.modular.com/community), and [join our forum](https://forum.modular.com/). And if you're interested in becoming a design partner to get early access and give us feedback, please [contact us](https://www.modular.com/company/contact). --- ## Deploying import MDXListing from '@site/src/components/Listing/MDXListing'; import TutorialStack from '@site/src/components/TutorialStack'; Our Kubernetes-ready Docker container simplifies the process of deploying a GenAI model to the cloud with your own endpoint. We also offer step-by-step tutorials to deploy your endpoint with services such as AWS, GCP, and Azure. ## Guides export const docs = [ '../container', '../../mammoth/index', ] ## Tutorials export const tutorials = [ 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes', 'deploy-serverless-cloud-run', ]; --- ## depth `depth(src: IntTuple[origin]) -> Int` Calculates the maximum nesting depth of an `IntTuple`. This function recursively traverses the `IntTuple` structure to determine its maximum nesting depth. A scalar value has depth 0, a flat tuple has depth 1, and nested tuples increase the depth accordingly. Example: ```mojo from layout import IntTuple, depth print(depth(IntTuple(1))) # prints 0 print(depth(IntTuple(1, 2))) # prints 1 print(depth((IntTuple(1, 2)))) # prints 2 ``` . **Args:** * ​src (`IntTuple[origin]`): The `IntTuple` to measure the depth of. **Returns:** An integer representing the maximum nesting depth. --- ## deque Defines the Deque type. You can import these APIs from the `collections` package. Examples: ```mojo from collections import Deque ``` ## Structs * [​`Deque`](/mojo/stdlib/collections/deque/Deque): Implements a double-ended queue. --- ## Deque `struct Deque[ElementType: Copyable & Movable]` Implements a double-ended queue. It supports pushing and popping from both ends in O(1) time resizing the underlying storage as needed. ## Parameters * ​ElementType (`Copyable & Movable`): The type of the elements in the deque. Must implement the traits `Copyable` and `Movable`. ## Implemented traits `AnyType`, `Boolable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `default_capacity` `alias default_capacity = 64` The default capacity of the deque: must be the power of 2. ## Methods ### `__init__` `__init__(out self, *, owned elements: Optional[List[ElementType]] = Optional(None), capacity: Int = 64, min_capacity: Int = 64, maxlen: Int = -1, shrink: Bool = True)` Constructs a deque. **Args:** * ​elements (`Optional[List[ElementType]]`): The optional list of initial deque elements. * ​capacity (`Int`): The initial capacity of the deque. * ​min\_capacity (`Int`): The minimum allowed capacity of the deque when shrinking. * ​maxlen (`Int`): The maximum allowed capacity of the deque when growing. * ​shrink (`Bool`): Should storage be de-allocated when not needed. `__init__(out self, owned *values: ElementType)` Constructs a deque from the given values. **Args:** * ​\*values (`ElementType`): The values to populate the deque with. `__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])` Constructs a deque from the given values. **Args:** * ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The values to populate the deque with. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves data of an existing deque into a new one. **Args:** * ​existing (`Self`): The existing deque. ### `__del__` `__del__(owned self)` Destroys all elements in the deque and free its memory. ### `__bool__` `__bool__(self) -> Bool` Checks whether the deque has any elements or not. **Returns:** `False` if the deque is empty, `True` if there is at least one element. ### `__getitem__` `__getitem__(ref self, idx: Int) -> ref [self] ElementType` Gets the deque element at the given index. **Args:** * ​idx (`Int`): The index of the element. **Returns:** A reference to the element at the given index. ### `__eq__` `__eq__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool` Checks if two deques are equal. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​other (`Deque[T]`): The deque to compare with. **Returns:** `True` if the deques are equal, `False` otherwise. ### `__ne__` `__ne__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool` Checks if two deques are not equal. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​other (`Deque[T]`): The deque to compare with. **Returns:** `True` if the deques are not equal, `False` otherwise. ### `__contains__` `__contains__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Bool` Verify if a given value is present in the deque. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​value (`T`): The value to find. **Returns:** True if the value is contained in the deque, False otherwise. ### `__add__` `__add__(self, other: Self) -> Self` Concatenates self with other and returns the result as a new deque. **Args:** * ​other (`Self`): Deque whose elements will be appended to the elements of self. **Returns:** The newly created deque with the properties of `self`. ### `__mul__` `__mul__(self, n: Int) -> Self` Concatenates `n` deques of `self` and returns a new deque. **Args:** * ​n (`Int`): The multiplier number. **Returns:** The new deque. ### `__iadd__` `__iadd__(mut self, other: Self)` Appends the elements of other deque into self. **Args:** * ​other (`Self`): Deque whose elements will be appended to self. ### `__imul__` `__imul__(mut self, n: Int)` Concatenates self `n` times in place. **Args:** * ​n (`Int`): The multiplier number. ### `copy` `copy(self) -> Self` Creates a deepcopy of the given deque. **Returns:** A copy of the value. ### `__iter__` `__iter__(ref self) -> _DequeIter[ElementType, self_is_origin]` Iterates over elements of the deque, returning the references. **Returns:** An iterator of the references to the deque elements. ### `__reversed__` `__reversed__(ref self) -> _DequeIter[ElementType, self_is_origin, False]` Iterate backwards over the deque, returning the references. **Returns:** A reversed iterator of the references to the deque elements. ### `__len__` `__len__(self) -> Int` Gets the number of elements in the deque. **Returns:** The number of elements in the deque. ### `write_to` `write_to[T: Representable & Copyable & Movable, WriterType: Writer](self: Deque[T], mut writer: WriterType)` Writes `my_deque.__str__()` to a `Writer`. **Parameters:** * ​T (`Representable & Copyable & Movable`): The type of the Deque elements. Must implement the trait `Representable`. * ​WriterType (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`WriterType`): The object to write to. ### `__str__` `__str__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String` Returns a string representation of a `Deque`. Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo my_deque = Deque[Int](1, 2, 3) print(my_deque.__str__()) ``` When the compiler supports conditional methods, then a simple `String(my_deque)` will be enough. **Parameters:** * ​T (`Representable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `Representable`. **Returns:** A string representation of the deque. ### `__repr__` `__repr__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String` Returns a string representation of a `Deque`. Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo my_deque = Deque[Int](1, 2, 3) print(my_deque.__repr__()) ``` When the compiler supports conditional methods, then a simple `repr(my_deque)` will be enough. **Parameters:** * ​T (`Representable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `Representable`. **Returns:** A string representation of the deque. ### `append` `append(mut self, owned value: ElementType)` Appends a value to the right side of the deque. **Args:** * ​value (`ElementType`): The value to append. ### `appendleft` `appendleft(mut self, owned value: ElementType)` Appends a value to the left side of the deque. **Args:** * ​value (`ElementType`): The value to append. ### `clear` `clear(mut self)` Removes all elements from the deque leaving it with length 0. Resets the underlying storage capacity to `_min_capacity`. ### `count` `count[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Int` Counts the number of occurrences of a `value` in the deque. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the trait `EqualityComparable`. **Args:** * ​value (`T`): The value to count. **Returns:** The number of occurrences of the value in the deque. ### `extend` `extend(mut self, owned values: List[ElementType])` Extends the right side of the deque by consuming elements of the list argument. **Args:** * ​values (`List[ElementType]`): List whose elements will be added at the right side of the deque. ### `extendleft` `extendleft(mut self, owned values: List[ElementType])` Extends the left side of the deque by consuming elements from the list argument. Acts as series of left appends resulting in reversed order of elements in the list argument. **Args:** * ​values (`List[ElementType]`): List whose elements will be added at the left side of the deque. ### `index` `index[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int` Returns the index of the first occurrence of a `value` in a deque restricted by the range given the `start` and `stop` bounds. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the `EqualityComparable` trait. **Args:** * ​value (`T`): The value to search for. * ​start (`Int`): The starting index of the search, treated as a slice index (defaults to 0). * ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index (defaults to None, which means the end of the deque). **Returns:** The index of the first occurrence of the value in the deque. **Raises:** ValueError: If the value is not found in the deque. ### `insert` `insert(mut self, idx: Int, owned value: ElementType)` Inserts the `value` into the deque at position `idx`. **Args:** * ​idx (`Int`): The position to insert the value into. * ​value (`ElementType`): The value to insert. **Raises:** IndexError: If deque is already at its maximum size. ### `remove` `remove[T: EqualityComparable & Copyable & Movable, //](mut self: Deque[T], value: T)` Removes the first occurrence of the `value`. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque. Must implement the `EqualityComparable` trait. **Args:** * ​value (`T`): The value to remove. **Raises:** ValueError: If the value is not found in the deque. ### `peek` `peek(self) -> ElementType` Inspect the last (rightmost) element of the deque without removing it. **Returns:** The last (rightmost) element of the deque. **Raises:** IndexError: If the deque is empty. ### `peekleft` `peekleft(self) -> ElementType` Inspect the first (leftmost) element of the deque without removing it. **Returns:** The first (leftmost) element of the deque. **Raises:** IndexError: If the deque is empty. ### `pop` `pop(mut self) -> ElementType` Removes and returns the element from the right side of the deque. **Returns:** The popped value. **Raises:** IndexError: If the deque is empty. ### `popleft` `popleft(mut self) -> ElementType` Removes and returns the element from the left side of the deque. **Returns:** The popped value. **Raises:** IndexError: If the deque is empty. ### `reverse` `reverse(mut self)` Reverses the elements of the deque in-place. ### `rotate` `rotate(mut self, n: Int = 1)` Rotates the deque by `n` steps. If `n` is positive, rotates to the right. If `n` is negative, rotates to the left. **Args:** * ​n (`Int`): Number of steps to rotate the deque (defaults to 1). --- ## Developing import MDXListing from '@site/src/components/Listing/MDXListing'; import TutorialStack from '@site/src/components/TutorialStack'; We built the Modular Platform from the ground up to simplify AI development for production and get the most out of your GPUs. Although it's not a machine learning framework, Modular provides programmability at every layer of the stack. You can build graphs in Python and write custom ops with hardware-agnostic GPU kernels in Mojo. None of it uses CUDA or other vendor-specific frameworks. ## Guides export const docs = [ '../custom-ops/index', '../graph/quantize', ] ## Tutorials export const tutorials = [ 'max-pipeline-bring-your-own-model', 'build-custom-ops', 'get-started-with-max-graph-in-python', ]; export const mojoTutorials = [ 'gpu/intro-tutorial', ]; --- ## Device `struct Device` ## Fields * ​idx (`Int`): * ​device (`_DeviceImpl`): ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, idx: Int = 0)` ### `__copyinit__` `__copyinit__(out self, existing: Self)` ### `get_driver_version` `get_driver_version(self) -> DriverVersion` Returns NVIDIA driver version. ### `max_mem_clock` `max_mem_clock(self) -> Int` ### `max_graphics_clock` `max_graphics_clock(self) -> Int` ### `mem_clocks` `mem_clocks(self) -> List[Int, True]` ### `graphics_clocks` `graphics_clocks(self, memory_clock_mhz: Int) -> List[Int, True]` ### `set_clock` `set_clock(self, mem_clock: Int, graphics_clock: Int)` ### `gpu_turbo_enabled` `gpu_turbo_enabled(self) -> Bool` Returns True if the gpu turbo is enabled. ### `set_gpu_turbo` `set_gpu_turbo(self, enabled: Bool = True)` Sets the GPU turbo state. ### `get_persistence_mode` `get_persistence_mode(self) -> Bool` Returns True if the gpu persistence mode is enabled. ### `set_persistence_mode` `set_persistence_mode(self, enabled: Bool = True)` Sets the persistence mode. ### `set_max_gpu_clocks` `set_max_gpu_clocks(device)` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` ### `__repr__` `__repr__(self) -> String` --- ## device_attribute This module defines GPU device attributes that can be queried from CUDA-compatible devices. The module provides the `DeviceAttribute` struct which encapsulates the various device properties and capabilities that can be queried through the CUDA driver API. Each attribute is represented as a constant with a corresponding integer value that maps to the CUDA driver's attribute enumeration. These attributes allow applications to query specific hardware capabilities and limitations of GPU devices, such as maximum thread counts, memory sizes, compute capabilities, and supported features. ## Structs * [​`DeviceAttribute`](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute): Represents CUDA device attributes that can be queried from a GPU device. --- ## device_context This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator. ## Structs * [​`DeviceBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer): Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory. * [​`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext): Represents a single stream of execution on a particular accelerator (GPU). * [​`DeviceExternalFunction`](/mojo/stdlib/gpu/host/device_context/DeviceExternalFunction): Represents an external device function loaded from PTX/SASS assembly. * [​`DeviceFunction`](/mojo/stdlib/gpu/host/device_context/DeviceFunction): Represents a compiled device function for GPU execution. * [​`DeviceMulticastBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceMulticastBuffer): Represents a muticast memory object enables special memory operations to be broadcast across a group of devices. * [​`DeviceStream`](/mojo/stdlib/gpu/host/device_context/DeviceStream): Represents a CUDA/HIP stream for asynchronous GPU operations. * [​`HostBuffer`](/mojo/stdlib/gpu/host/device_context/HostBuffer): Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory. --- ## device_passable ## Traits * [​`DevicePassable`](/mojo/stdlib/builtin/device_passable/DevicePassable): This trait marks types as passable to accelerator devices. --- ## DeviceAttribute `@register_passable(trivial)` `struct DeviceAttribute` Represents CUDA device attributes that can be queried from a GPU device. This struct encapsulates the various device properties and capabilities that can be queried through the CUDA driver API. Each attribute is represented as a constant with a corresponding integer value that maps to the CUDA driver's attribute enum. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `CLOCK_RATE` `alias CLOCK_RATE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](13))` Typical clock frequency in kilohertz ### `COMPUTE_CAPABILITY_MAJOR` `alias COMPUTE_CAPABILITY_MAJOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](75))` Major compute capability version number ### `COMPUTE_CAPABILITY_MINOR` `alias COMPUTE_CAPABILITY_MINOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](76))` Minor compute capability version number ### `MAX_ACCESS_POLICY_WINDOW_SIZE` `alias MAX_ACCESS_POLICY_WINDOW_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](109))` CUDA-only: Maximum value of CUaccessPolicyWindow::num\_bytes. ### `MAX_BLOCK_DIM_X` `alias MAX_BLOCK_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](2))` Maximum block dimension X ### `MAX_BLOCK_DIM_Y` `alias MAX_BLOCK_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](3))` Maximum block dimension Y ### `MAX_BLOCK_DIM_Z` `alias MAX_BLOCK_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](4))` Maximum block dimension Z ### `MAX_BLOCKS_PER_MULTIPROCESSOR` `alias MAX_BLOCKS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](106))` Maximum resident blocks per multiprocessor ### `MAX_GRID_DIM_X` `alias MAX_GRID_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](5))` Maximum grid dimension X ### `MAX_GRID_DIM_Y` `alias MAX_GRID_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](6))` Maximum grid dimension Y ### `MAX_GRID_DIM_Z` `alias MAX_GRID_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](7))` Maximum grid dimension Z ### `MAX_REGISTERS_PER_BLOCK` `alias MAX_REGISTERS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](12))` Maximum number of 32-bit registers available per block ### `MAX_REGISTERS_PER_MULTIPROCESSOR` `alias MAX_REGISTERS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](82))` Maximum number of 32-bit registers available per multiprocessor ### `MAX_SHARED_MEMORY_PER_BLOCK` `alias MAX_SHARED_MEMORY_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](8))` Maximum shared memory available per block in bytes ### `MAX_SHARED_MEMORY_PER_MULTIPROCESSOR` `alias MAX_SHARED_MEMORY_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](81))` Maximum shared memory available per multiprocessor in bytes ### `MAX_THREADS_PER_BLOCK` `alias MAX_THREADS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](1))` Maximum number of threads per block ### `MAX_THREADS_PER_MULTIPROCESSOR` `alias MAX_THREADS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](39))` Maximum resident threads per multiprocessor ### `MULTIPROCESSOR_COUNT` `alias MULTIPROCESSOR_COUNT = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](16))` Number of multiprocessors on device ### `WARP_SIZE` `alias WARP_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](10))` Warp size in threads --- ## DeviceBuffer `struct DeviceBuffer[type: DType]` Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory. To allocate a `DeviceBuffer`, use one of the methods provided by `DeviceContext`, such as [`enqueue_create_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_buffer). ## Parameters * ​type (`DType`): Data type to be stored in the buffer. ## Implemented traits `AnyType`, `Copyable`, `DevicePassable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `device_type` `alias device_type = UnsafePointer[SIMD[type, 1]]` DeviceBuffer types are remapped to UnsafePointer when passed to accelerator devices. ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing device buffer by incrementing its reference count. This copy constructor creates a new reference to the same underlying device buffer by incrementing the reference count of the native buffer object. Both the original and the copy will refer to the same memory on the device. **Args:** * ​existing (`Self`): The device buffer to copy. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Initializes this buffer by taking ownership of an existing buffer. This move constructor transfers ownership of the device buffer from the existing instance to the new instance without incrementing the reference count. **Args:** * ​existing (`Self`): The buffer to move from, which will no longer be valid after this call. ### `__del__` `__del__(owned self)` Releases resources associated with this device buffer. This function schedules an owned buffer free using the stream in the device context. The actual deallocation may occur asynchronously after all operations using this buffer have completed. ### `get_type_name` `static get_type_name() -> String` Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `__len__` `__len__(self) -> Int` Returns the number of elements in this buffer. This method calculates the number of elements by dividing the total byte size of the buffer by the size of each element. **Returns:** The number of elements in the buffer. ### `create_sub_buffer` `create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> DeviceBuffer[view_type]` Creates a sub-buffer view of this buffer with a different element type. This method creates a new buffer that references a subset of the memory in this buffer, potentially with a different element type. The sub-buffer shares the underlying memory with the original buffer. **Parameters:** * ​view\_type (`DType`): The data type for elements in the new sub-buffer. **Args:** * ​offset (`Int`): The starting offset in elements from the beginning of this buffer. * ​size (`Int`): The number of elements in the new sub-buffer. **Returns:** A new DeviceBuffer referencing the specified region with the specified element type. ### `enqueue_copy_to` `enqueue_copy_to(self, dst: Self)` Enqueues an asynchronous copy from this buffer to another device buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`Self`): The destination device buffer to copy data to. `enqueue_copy_to(self, dst: HostBuffer[type])` Enqueues an asynchronous copy from this buffer to a host buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`HostBuffer[type]`): The destination host buffer to copy data to. `enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy from this buffer to host memory. This method schedules a memory copy operation from this device buffer to the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location. ### `enqueue_copy_from` `enqueue_copy_from(self, src: Self)` Enqueues an asynchronous copy to this buffer from another device buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`Self`): The source device buffer to copy data from. `enqueue_copy_from(self, src: HostBuffer[type])` Enqueues an asynchronous copy to this buffer from a host buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`HostBuffer[type]`): The source host buffer to copy data from. `enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy to this buffer from host memory. This method schedules a memory copy operation to this device buffer from the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location. ### `enqueue_fill` `enqueue_fill(self, val: SIMD[type, 1]) -> Self` Enqueues an operation to fill this buffer with a specified value. This method schedules a memory set operation that fills the entire buffer with the specified value. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​val (`SIMD[type, 1]`): The value to fill the buffer with. **Returns:** Self reference for method chaining. ### `reassign_ownership_to` `reassign_ownership_to(self, ctx: DeviceContext)` Transfers ownership of this buffer to another device context. This method changes the device context that owns this buffer. This can be useful when sharing buffers between different contexts or when migrating workloads between devices. **Args:** * ​ctx (`DeviceContext`): The new device context to take ownership of this buffer. ### `take_ptr` `take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]` Takes ownership of the device pointer from this buffer. This method releases the device pointer from the buffer's control and returns it to the caller. After this call, the buffer no longer owns the pointer, and the caller is responsible for managing its lifecycle. **Returns:** The raw device pointer that was owned by this buffer. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]` Returns the raw device pointer without transferring ownership. This method provides direct access to the underlying device pointer for advanced use cases. The buffer retains ownership of the pointer. **Returns:** The raw device pointer owned by this buffer. ### `context` `context(self) -> DeviceContext` Returns the device context associated with this buffer. This method retrieves the device context that owns this buffer and is responsible for managing its lifecycle and operations. **Returns:** The device context associated with this buffer. ### `map_to_host` `map_to_host(self, out mapped_buffer: _HostMappedBuffer[type])` Maps this device buffer to host memory for CPU access. This method creates a host-accessible view of the device buffer's contents. The mapping operation may involve copying data from device to host memory. Notes: Values modified inside the `with` statement are updated on the device when the `with` statement exits. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() var length = 1024 var in_dev = ctx.enqueue_create_buffer[DType.float32](length) var out_dev = ctx.enqueue_create_buffer[DType.float32](length) # Initialize the input and output with known values. with in_dev.map_to_host() as in_host, out_dev.map_to_host() as out_host: for i in range(length): in_host[i] = i out_host[i] = 255 ``` **Returns:** A host-mapped buffer that provides CPU access to the device buffer's contents inside a with-statement. **Raises:** If there's an error during buffer creation or data transfer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of this buffer to the provided writer. This method formats the buffer's contents as a string and writes it to the specified writer. For large buffers, a compact representation is used. **Parameters:** * ​W (`Writer`): The writer type. **Args:** * ​writer (`W`): The writer to output the formatted string to. ### `__str__` `__str__(self) -> String` Returns a string representation of the `DeviceBuffer`. This method creates a human-readable string representation of the buffer's contents by mapping the device memory to host memory and formatting the elements. **Returns:** A string containing the formatted buffer contents. --- ## DeviceContext `@register_passable` `struct DeviceContext` Represents a single stream of execution on a particular accelerator (GPU). A `DeviceContext` serves as the low-level interface to the accelerator inside a MAX [custom operation](/max/custom-ops/) and provides methods for allocating buffers on the device, copying data between host and device, and for compiling and running functions (also known as kernels) on the device. The device context can be used as a [context manager](/mojo/manual/errors#use-a-context-manager). For example: ```mojo from gpu.host import DeviceContext from gpu import thread_idx fn kernel(): print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z) with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2)) ctx.synchronize() ``` A custom operation receives an opaque `DeviceContextPtr`, which provides a `get_device_context()` method to retrieve the device context: ```mojo from runtime.asyncrt import DeviceContextPtr @register("custom_op") struct CustomOp: @staticmethod fn execute(ctx_ptr: DeviceContextPtr) raises: var ctx = ctx_ptr.get_device_context() ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2)) ctx.synchronize() ``` ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `device_api` `alias device_api = from_name[::StringSlice[::Bool().api` Device API for the default accelerator (for example, "cuda" or "hip"). ### `device_info` `alias device_info = from_name[::StringSlice[::Bool()` `gpu.info.Info` object for the default accelerator. ## Methods ### `__init__` `__init__(out self, device_id: Int = 0, *, owned api: String = String(from_name[::StringSlice[::Bool()))` Constructs a `DeviceContext` for the specified device. This initializer creates a new device context for the specified accelerator device. The device context provides an interface for interacting with the GPU, including memory allocation, data transfer, and kernel execution. Example: ```mojo from gpu.host import DeviceContext # Create a context for the default GPU var ctx = DeviceContext() # Create a context for a specific GPU (device 1) var ctx2 = DeviceContext(1) ``` **Args:** * ​device\_id (`Int`): ID of the accelerator device. If not specified, uses the default accelerator (device 0). * ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by the DeviceContext class. **Raises:** If device initialization fails or the specified device is not available. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Creates a copy of an existing device context by incrementing its reference count. This copy constructor creates a new reference to the same underlying device context by incrementing the reference count of the native context object. Both the original and the copy will refer to the same device context. **Args:** * ​existing (`Self`): The device context to copy. ### `__del__` `__del__(owned self)` Releases resources associated with this device context. This destructor decrements the reference count of the native device context. When the reference count reaches zero, the underlying resources are released, including any cached memory buffers and compiled device functions. ### `copy` `copy(self) -> Self` Explicitly constructs a copy of this device context. This method creates a new reference to the same underlying device context by incrementing the reference count of the native context object. **Returns:** A copy of this device context that refers to the same underlying context. ### `__enter__` `__enter__(owned self) -> Self` Enables the use of DeviceContext in a 'with' statement context manager. This method allows DeviceContext to be used with Python-style context managers, which ensures proper resource management and cleanup when the context exits. Example: ```mojo from gpu.host import DeviceContext # Using DeviceContext as a context manager with DeviceContext() as ctx: # Perform GPU operations # Resources are automatically released when exiting the block ``` **Returns:** The DeviceContext instance to be used within the context manager block. ### `name` `name(self) -> String` Returns the device name, an ASCII string identifying this device, defined by the native device API. This method queries the underlying GPU device for its name, which typically includes the model and other identifying information. This can be useful for logging, debugging, or making runtime decisions based on the specific GPU hardware. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() print("Running on device:", ctx.name()) ``` **Returns:** A string containing the device name. ### `api` `api(self) -> String` Returns the name of the API used to program the device. This method queries the underlying device context to determine which GPU programming API is being used for the current device. This information is useful for writing code that can adapt to different GPU architectures and programming models. Possible values are: * "cpu": Generic host device (CPU). * "cuda": NVIDIA GPUs. * "hip": AMD GPUs. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() var api_name = ctx.api() print("Using device API:", api_name) # Conditionally execute code based on the API if api_name == "cuda": print("Running on NVIDIA GPU") elif api_name == "hip": print("Running on AMD GPU") ``` **Returns:** A string identifying the device API. ### `enqueue_create_buffer` `enqueue_create_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]` Enqueues a buffer creation using the `DeviceBuffer` constructor. For GPU devices, the space is allocated in the device's global memory. **Parameters:** * ​type (`DType`): The data type to be stored in the allocated memory. **Args:** * ​size (`Int`): The number of elements of `type` to allocate memory for. **Returns:** The allocated buffer. ### `create_buffer_sync` `create_buffer_sync[type: DType](self, size: Int) -> DeviceBuffer[type]` Creates a buffer synchronously using the `DeviceBuffer` constructor. **Parameters:** * ​type (`DType`): The data type to be stored in the allocated memory. **Args:** * ​size (`Int`): The number of elements of `type` to allocate memory for. **Returns:** The allocated buffer. ### `enqueue_create_host_buffer` `enqueue_create_host_buffer[type: DType](self, size: Int) -> HostBuffer[type]` Enqueues the creation of a HostBuffer. This function allocates memory on the host that is accessible by the device. The memory is page-locked (pinned) for efficient data transfer between host and device. Pinned memory is guaranteed to remain resident in the host's RAM, not be paged/swapped out to disk. Memory allocated normally (for example, using [`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_ptr/UnsafePointer#alloc)) is pageable—individual pages of memory can be moved to secondary storage (disk/SSD) when main memory fills up. Using pinned memory allows devices to make fast transfers between host memory and device memory, because they can use direct memory access (DMA) to transfer data without relying on the CPU. Allocating too much pinned memory can cause performance issues, since it reduces the amount of memory available for other processes. Example: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: # Allocate host memory accessible by the device var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024) # Use the host buffer for device operations # ... ``` **Parameters:** * ​type (`DType`): The data type to be stored in the allocated memory. **Args:** * ​size (`Int`): The number of elements of `type` to allocate memory for. **Returns:** A `HostBuffer` object that wraps the allocated host memory. **Raises:** If memory allocation fails or if the device context is invalid. ### `compile_function` `compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​func (`func_type`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `compile_function_unchecked` `compile_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​func (`func_type`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `compile_function_checked` `compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile. * ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. `compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of the function. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile. * ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `compile_function_experimental` `compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. `compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])` Compiles the provided function for execution on this device. **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). * ​\_target (`target`): Change the target to different device type than the one associated with this `DeviceContext`. **Args:** * ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such as maximum shared memory size). **Returns:** The compiled function. ### `load_function` `load_function[func_type: AnyTrivialRegType, //, func: func_type](self, *, function_name: StringSlice[origin], asm: StringSlice[origin], func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceExternalFunction)` Loads a pre-compiled device function from assembly code. This method loads an external GPU function from provided assembly code (PTX/SASS) rather than compiling it from Mojo source. This is useful for integrating with existing CUDA/HIP code or for using specialized assembly optimizations. Example: ```mojo from gpu.host import DeviceContext from gpu.host.device_context import DeviceExternalFunction fn func_signature( # Arguments being passed to the assembly code # e.g. two pointers and a length input: UnsafePointer[Float32], output: UnsafePointer[Float32], len: Int, ): # No body because that is passed as assembly code below. pass var ctx = DeviceContext() var ptx_code = "..." # PTX assembly code var ext_func = ctx.load_function[func_signature]( function_name="my_kernel", asm=ptx_code, ) ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to load. * ​func (`func_type`): The function reference. **Args:** * ​function\_name (`StringSlice[origin]`): The name of the function in the assembly code. * ​asm (`StringSlice[origin]`): The assembly code (PTX/SASS) containing the function. * ​func\_attribute (`OptionalReg[FuncAttribute]`): Optional attribute to apply to the function (such as maximum shared memory size). **Returns:** The loaded function is stored in the `result` parameter. **Raises:** If loading the function fails or the assembly code is invalid. ### `enqueue_function` `enqueue_function[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​func (`func_type`): The function to launch. * ​\*Ts (`AnyType`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*Ts`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())` Enqueues a compiled function for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: var compiled_func = ctx.compile_function[kernel]() ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​\*Ts (`AnyType`): Argument types. **Args:** * ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute. * ​\*args (`*Ts`): Arguments to pass to the function. * ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread blocks. * ​block\_dim (`Dim`): Dimensions of each thread block in the grid. * ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are grouped into clusters). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block. * ​attributes (`List[LaunchAttribute]`): Launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping. `enqueue_function[*Ts: AnyType](self, f: DeviceExternalFunction, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())` Enqueues an external device function for asynchronous execution on the GPU. This method schedules an external device function to be executed on the GPU with the specified execution configuration. The function and its arguments are passed to the underlying GPU runtime, which will execute them when resources are available. Example: ```mojo from gpu.host import DeviceContext from gpu.host.device_context import DeviceExternalFunction # Create a device context and load an external function with DeviceContext() as ctx: var ext_func = DeviceExternalFunction("my_kernel") # Enqueue the external function with execution configuration ctx.enqueue_function( ext_func, grid_dim=Dim(16), block_dim=Dim(256) ) # Wait for completion ctx.synchronize() ``` **Parameters:** * ​\*Ts (`AnyType`): The types of the arguments to be passed to the device function. **Args:** * ​f (`DeviceExternalFunction`): The external device function to execute. * ​\*args (`*Ts`): The arguments to pass to the device function. * ​grid\_dim (`Dim`): The dimensions of the grid (number of thread blocks). * ​block\_dim (`Dim`): The dimensions of each thread block (number of threads per block). * ​cluster\_dim (`OptionalReg[Dim]`): Optional dimensions for thread block clusters (for newer GPU architectures). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Optional amount of dynamic shared memory to allocate per block. * ​attributes (`List[LaunchAttribute]`): Optional list of launch attributes for fine-grained control. * ​constant\_memory (`List[ConstantMemoryMapping]`): Optional list of constant memory mappings to use during execution. **Raises:** If there's an error enqueuing the function or if the function execution fails. ### `enqueue_function_unchecked` `enqueue_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​func (`func_type`): The function to launch. * ​\*Ts (`AnyType`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*Ts`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function_unchecked[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())` Enqueues a compiled function for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: var compiled_func = ctx.compile_function[kernel]() ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​\*Ts (`AnyType`): Argument types. **Args:** * ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute. * ​\*args (`*Ts`): Arguments to pass to the function. * ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread blocks. * ​block\_dim (`Dim`): Dimensions of each thread block in the grid. * ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are grouped into clusters). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block. * ​attributes (`List[LaunchAttribute]`): Launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping. ### `enqueue_function_checked` `enqueue_function_checked[*Ts: DevicePassable](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())` Enqueues a compiled function for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile the function first to remove the overhead: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: var compiled_func = ctx.compile_function[kernel]() ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​\*Ts (`DevicePassable`): Argument types. **Args:** * ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute. * ​\*args (`*Ts`): Arguments to pass to the function. * ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread blocks. * ​block\_dim (`Dim`): Dimensions of each thread block in the grid. * ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are grouped into clusters). * ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block. * ​attributes (`List[LaunchAttribute]`): Launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping. `enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile and launch. * ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​func\_type (`AnyTrivialRegType`): The type of the function to launch. * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`func_type`): The function to compile and launch. * ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch, passed in again. Used for checking argument types later. Note: This will disappear in future versions. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. ### `enqueue_function_experimental` `enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. `enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))` Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`. You can pass the function directly to `enqueue_function` without compiling it first: ```mojo from gpu.host import DeviceContext fn kernel(): print("hello from the GPU") with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=1, block_dim=1) ctx.synchronize() ``` If you are reusing the same function and parameters multiple times, this incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it first to remove the overhead: ```mojo with DeviceContext() as ctx: var compile_func = ctx.compile_function[kernel]() ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1) ctx.synchronize() ``` **Parameters:** * ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function. * ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch. * ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function. * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file path to dump to, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file path to dump to, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Pass `True`, or a file path to dump to, or a function returning a file path. * ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changes `dump_asm` to output verbose PTX assembly (default `False`). **Args:** * ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`. * ​grid\_dim (`Dim`): The grid dimensions. * ​block\_dim (`Dim`): The block dimensions. * ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions. * ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks. * ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes. * ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings. * ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum. ### `execution_time` `execution_time[: origin.set, //, func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int` Measures the execution time of a function that takes a DeviceContext parameter. This method times the execution of a provided function that requires the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds. Example: ```mojo from gpu.host import DeviceContext fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None: # Perform some GPU operation using ctx pass with DeviceContext() as ctx: # Measure execution time of a function that uses the context var time_ns = ctx.execution_time[gpu_operation](10) print("Execution time for 10 iterations:", time_ns, "ns") ``` **Parameters:** * ​func (`fn(DeviceContext) raises capturing -> None`): A function that takes a DeviceContext parameter to execute and time. **Args:** * ​num\_iters (`Int`): The number of iterations to run the function. **Returns:** The total elapsed time in nanoseconds for all iterations. **Raises:** If the timer operations fail or if the function raises an exception. `execution_time[: origin.set, //, func: fn() raises capturing -> None](self, num_iters: Int) -> Int` Measures the execution time of a function over multiple iterations. This method times the execution of a provided function that doesn't require the DeviceContext as a parameter. It runs the function for the specified number of iterations and returns the total elapsed time in nanoseconds. Example: ```mojo from gpu.host import DeviceContext fn some_gpu_operation() raises capturing [_] -> None: # Perform some GPU operation pass with DeviceContext() as ctx: # Measure execution time of a function var time_ns = ctx.execution_time[some_gpu_operation] print("Execution time:", time_ns, "ns") ``` **Parameters:** * ​func (`fn() raises capturing -> None`): A function with no parameters to execute and time. **Args:** * ​num\_iters (`Int`): The number of iterations to run the function. **Returns:** The total elapsed time in nanoseconds for all iterations. **Raises:** If the timer operations fail or if the function raises an exception. ### `execution_time_iter` `execution_time_iter[: origin.set, //, func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int` Measures the execution time of a function that takes iteration index as input. This method times the execution of a provided function that requires both the DeviceContext and the current iteration index as parameters. It runs the function for the specified number of iterations, passing the iteration index to each call, and returns the total elapsed time in nanoseconds. Example: ```mojo from gpu.host import DeviceContext var my_kernel = DeviceFunction(...) fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None: # Run kernel with different parameters based on iteration ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256)) with DeviceContext() as ctx: # Measure execution time with iteration awareness var time_ns = ctx.execution_time_iter[benchmark_kernel](10) print("Total execution time:", time_ns, "ns") ``` **Parameters:** * ​func (`fn(DeviceContext, Int) raises capturing -> None`): A function that takes the DeviceContext and an iteration index. **Args:** * ​num\_iters (`Int`): The number of iterations to run the function. **Returns:** The total elapsed time in nanoseconds for all iterations. **Raises:** If the timer operations fail or if the function raises an exception. ### `enqueue_copy` `enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to. * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from. `enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to. * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from. `enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type])` Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to. * ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. `enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: HostBuffer[type])` Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to. * ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. `enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)` Enqueues an async copy of `size` elements from a device pointer to another device pointer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to. * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Device pointer to copy from. * ​size (`Int`): Number of elements (of the specified `DType`) to copy. `enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: DeviceBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to. * ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. `enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: HostBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to. * ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. `enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: DeviceBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to. * ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. `enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: HostBuffer[type])` Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer. **Parameters:** * ​type (`DType`): Type of the data being copied. **Args:** * ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to. * ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as `dst`. ### `enqueue_memset` `enqueue_memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])` Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value. **Parameters:** * ​type (`DType`): Type of the data stored in the buffer. **Args:** * ​dst (`DeviceBuffer[type]`): Destination buffer. * ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to. `enqueue_memset[type: DType](self, dst: HostBuffer[type], val: SIMD[type, 1])` Enqueues an async memset operation, setting all of the elements in the destination host buffer to the specified value. **Parameters:** * ​type (`DType`): Type of the data stored in the buffer. **Args:** * ​dst (`HostBuffer[type]`): Destination buffer. * ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to. ### `synchronize` `synchronize(self)` Blocks until all asynchronous calls on the stream associated with this device context have completed. This should never be necessary when writing a custom operation. ### `enqueue_wait_for` `enqueue_wait_for(self, other: Self)` Enqueues a wait operation for another device context to complete its work. This method creates a dependency between two device contexts, ensuring that operations in the current context will not begin execution until all previously enqueued operations in the other context have completed. This is useful for synchronizing work across multiple devices or streams. Example: ```mojo from gpu.host import DeviceContext # Create two device contexts var ctx1 = DeviceContext(0) # First GPU var ctx2 = DeviceContext(1) # Second GPU # Enqueue operations on ctx1 # ... # Make ctx2 wait for ctx1 to complete before proceeding ctx2.enqueue_wait_for(ctx1) # Enqueue operations on ctx2 that depend on ctx1's completion # ... ``` **Args:** * ​other (`Self`): The device context whose operations must complete before operations in this context can proceed. **Raises:** If there's an error enqueuing the wait operation or if the operation is not supported by the underlying device API. ### `get_api_version` `get_api_version(self) -> Int` Returns the API version associated with this device. This method retrieves the version number of the GPU driver currently installed on the system for the device associated with this context. The version is returned as an integer that can be used to check compatibility with specific features or to troubleshoot driver-related issues. Example: ```mojo from gpu.host import DeviceContext with DeviceContext() as ctx: # Get the API version var api_version = ctx.get_api_version() print("GPU API version:", api_version) ``` **Returns:** An integer representing the driver version. **Raises:** If the driver version cannot be retrieved or if the device context is invalid. ### `get_attribute` `get_attribute(self, attr: DeviceAttribute) -> Int` Returns the specified attribute for this device. Use the aliases defined by [DeviceAttribute](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute) to specify attributes. For example: ```mojo from gpu.host import DeviceAttribute, DeviceContext def main(): var ctx = DeviceContext() var attr = DeviceAttribute.MAX_BLOCKS_PER_MULTIPROCESSOR var max_blocks = ctx.get_attribute(attr) print(max_blocks) ``` **Args:** * ​attr (`DeviceAttribute`): The device attribute to query. **Returns:** The value for `attr` on this device. ### `is_compatible` `is_compatible(self) -> Bool` Returns True if this device is compatible with MAX. This method checks whether the current device is compatible with the Modular Accelerated Execution (MAX) runtime. It's useful for validating that the device can execute the compiled code before attempting operations. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() print("Device is compatible with MAX:", ctx.is_compatible()) ``` **Returns:** True if the device is compatible with MAX, False otherwise. ### `id` `id(self) -> SIMD[int64, 1]` Returns the ID associated with this device. This method retrieves the unique identifier for the current device. Device IDs are used to distinguish between multiple devices in a system and are often needed for multi-GPU programming. Example: ```mojo var ctx = DeviceContext() try: var device_id = ctx.id() print("Using device with ID:", device_id) except: print("Failed to get device ID") ``` **Returns:** The unique device ID as an Int64. **Raises:** If there's an error retrieving the device ID. ### `get_memory_info` `get_memory_info(self) -> Tuple[UInt, UInt]` Returns the free and total memory size for this device. This method queries the current state of device memory, providing information about how much memory is available and the total memory capacity of the device. This is useful for memory management and determining if there's enough space for planned operations. Example: ```mojo from gpu.host import DeviceContext var ctx = DeviceContext() try: (free, total) = ctx.get_memory_info() print("Free memory:", free / (1024*1024), "MB") print("Total memory:", total / (1024*1024), "MB") except: print("Failed to get memory information") ``` **Returns:** A tuple of (free memory, total memory) in bytes. **Raises:** If there's an error retrieving the memory information. ### `can_access` `can_access(self, peer: Self) -> Bool` Returns True if this device can access the identified peer device. This method checks whether the current device can directly access memory on the specified peer device. Peer-to-peer access allows for direct memory transfers between devices without going through host memory, which can significantly improve performance in multi-GPU scenarios. Example: ```mojo from gpu.host import DeviceContext var ctx1 = DeviceContext(0) # First GPU var ctx2 = DeviceContext(1) # Second GPU try: if ctx1.can_access(ctx2): print("Direct peer access is possible") ctx1.enable_peer_access(ctx2) else: print("Direct peer access is not supported") except: print("Failed to check peer access capability") ``` **Args:** * ​peer (`Self`): The peer device to check for accessibility. **Returns:** True if the current device can access the peer device, False otherwise. **Raises:** If there's an error checking peer access capability. ### `enable_peer_access` `enable_peer_access(self, peer: Self)` Enables direct memory access to the peer device. This method establishes peer-to-peer access from the current device to the specified peer device. Once enabled, the current device can directly read from and write to memory allocated on the peer device without going through host memory, which can significantly improve performance for multi-GPU operations. Notes: * It's recommended to call `can_access()` first to check if peer access is possible. * Peer access is not always symmetric; you may need to enable access in both directions. Example: ```mojo from gpu.host import DeviceContext var ctx1 = DeviceContext(0) # First GPU var ctx2 = DeviceContext(1) # Second GPU try: if ctx1.can_access(ctx2): ctx1.enable_peer_access(ctx2) print("Peer access enabled from device 0 to device 1") # For bidirectional access if ctx2.can_access(ctx1): ctx2.enable_peer_access(ctx1) print("Peer access enabled from device 1 to device 0") else: print("Peer access not supported between these devices") except: print("Failed to enable peer access") ``` **Args:** * ​peer (`Self`): The peer device to enable access to. **Raises:** If there's an error enabling peer access or if peer access is not supported between the devices. ### `supports_multicast` `supports_multicast(self) -> Bool` Returns True if this device supports multicast memory mappings. **Returns:** True if the current device supports multicast memory, False otherwise. **Raises:** If there's an error checking peer access capability. ### `number_of_devices` `static number_of_devices(*, api: String = String(from_name[::StringSlice[::Bool())) -> Int` Returns the number of devices available that support the specified API. This function queries the system for available devices that support the requested API (such as CUDA or HIP). It's useful for determining how many accelerators are available before allocating resources or distributing work. Example: ```mojo from gpu.host import DeviceContext # Get number of CUDA devices var num_cuda_devices = DeviceContext.number_of_devices(api="cuda") # Get number of devices for the default API var num_devices = DeviceContext.number_of_devices() ``` **Args:** * ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the device API specified by the DeviceContext class. **Returns:** The number of available devices supporting the specified API. --- ## DeviceContextPtr `@register_passable(trivial)` `struct DeviceContextPtr` Exposes a pointer to a C++ DeviceContext to Mojo. Note: When initializing a `DeviceContext` from a pointer, the refcount is not incremented. This is considered safe because `get_device_context()` is only used within kernels and the `DeviceContext` lifetime is managed by the graph compiler. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initialize an empty `DeviceContextPtr` with a null pointer. This creates a `DeviceContextPtr` that doesn't point to any device context. `@implicit` `__init__(handle: UnsafePointer[NoneType]) -> Self` Initialize a `DeviceContextPtr` from a raw pointer. **Args:** * ​handle (`UnsafePointer[NoneType]`): A raw pointer to a C++ `DeviceContext`. `@implicit` `__init__(device: DeviceContext) -> Self` Initialize a DeviceContextPtr from a `DeviceContext`. This constructor allows implicit conversion from `DeviceContext` to `DeviceContextPtr`. **Args:** * ​device (`DeviceContext`): The `DeviceContext` to wrap in this pointer. ### `__getitem__` `__getitem__(self) -> DeviceContext` Dereference the pointer to get the `DeviceContext`. **Returns:** The `DeviceContext` that this pointer points to. ### `get_device_context` `get_device_context(self) -> DeviceContext` Get the `DeviceContext` that this pointer points to. This is an alias for the dereference operator. **Returns:** The `DeviceContext` that this pointer points to. --- ## DeviceContextPtrList `@register_passable(trivial)` `struct DeviceContextPtrList[size: Int]` A fixed-size collection of `DeviceContextPtr` objects. This struct provides a lightweight, register-passable container for a fixed number of `DeviceContextPtr` objects, with array-like access semantics. ## Parameters * ​size (`Int`): The fixed number of `DeviceContextPtr` objects in the collection. ## Fields * ​ptrs (`StaticTuple[DeviceContextPtr, size]`): The underlying storage for the device context pointers. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(ptrs: StaticTuple[DeviceContextPtr, size]) -> Self` Initialize with a StaticTuple of `DeviceContextPtr` objects. **Args:** * ​ptrs (`StaticTuple[DeviceContextPtr, size]`): A StaticTuple containing the `DeviceContextPtr` objects to store. ### `__getitem__` `__getitem__[index: Int](self) -> DeviceContext` Access a `DeviceContext` at a compile-time known index. **Parameters:** * ​index (`Int`): A compile-time integer index. **Returns:** The `DeviceContext` at the specified index. `__getitem__[I: Indexer, //](self, idx: I) -> DeviceContext` Access a `DeviceContext` using a runtime index value. **Parameters:** * ​I (`Indexer`): A type that conforms to the `Indexer` trait. **Args:** * ​idx (`I`): A runtime index value that conforms to the Indexer trait. **Returns:** The `DeviceContext` at the specified index. ### `__len__` `__len__(self) -> Int` Get the number of `DeviceContextPtr` objects in the collection. **Returns:** The size of the collection as specified by the size parameter. --- ## DeviceExternalFunction `struct DeviceExternalFunction` Represents an external device function loaded from PTX/SASS assembly. This class provides functionality to load and execute pre-compiled GPU functions from assembly code rather than compiling them from Mojo source. This is useful for integrating with existing CUDA/HIP code or for using specialized assembly optimizations. The `DeviceExternalFunction` handles reference counting of the underlying device function handle and provides methods for launching the function on a GPU with specified execution configuration. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing device function by incrementing its reference count. **Args:** * ​existing (`Self`): The device function to copy. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves an existing device function into this one. **Args:** * ​existing (`Self`): The device function to move from. ### `__del__` `__del__(owned self)` Releases resources associated with this device function. ### `get_attribute` `get_attribute(self, attr: Attribute) -> Int` Retrieves a specific attribute of this device function. **Args:** * ​attr (`Attribute`): The attribute to query. **Returns:** The value of the requested attribute. **Raises:** If the attribute query fails. --- ## DeviceFunction `struct DeviceFunction[func_type: AnyTrivialRegType, //, func: func_type, declared_arg_types: Optional[Variadic[AnyType]], *, target: target = _get_gpu_target[::StringSlice[::Bool(), _ptxas_info_verbose: Bool = False]` Represents a compiled device function for GPU execution. This struct encapsulates a compiled GPU function that can be launched on a device. It handles the compilation, loading, and resource management of device functions. Example: ```mojo from gpu.host import DeviceContext, DeviceFunction fn my_kernel(x: Int, y: Int): # Kernel implementation pass var ctx = DeviceContext() var kernel = ctx.compile_function[my_kernel]() ctx.enqueue_function(kernel, grid_dim=(1,1,1), block_dim=(32,1,1)) ``` ## Parameters * ​func\_type (`AnyTrivialRegType`): The type of the function to compile. * ​func (`func_type`): The function to compile for GPU execution. * ​declared\_arg\_types (`Optional[Variadic[AnyType]]`): An optional containing a variadic of the declared types of the kernel signature. * ​target (`target`): The target architecture for compilation. Defaults to the current GPU target. * ​\_ptxas\_info\_verbose (`Bool`): Whether to enable verbose PTX assembly output. Defaults to False. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing DeviceFunction. This increases the reference count of the underlying device function handle. **Args:** * ​existing (`Self`): The DeviceFunction to copy from. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves an existing DeviceFunction into this one. **Args:** * ​existing (`Self`): The DeviceFunction to move from. ### `__del__` `__del__(owned self)` Releases resources associated with this DeviceFunction. This decrements the reference count of the underlying device function handle. ### `dump_rep` `dump_rep[dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False)](self)` Dumps various representations of the compiled device function. This method dumps the assembly, LLVM IR, and/or SASS code for the compiled device function based on the provided parameters. The output can be directed to stdout or written to files. Notes: When a path contains '%', it will be replaced with the module name to help disambiguate multiple kernel dumps. **Parameters:** * ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of assembly code. Can be a boolean, a file path, or a function returning a file path. * ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of LLVM IR. Can be a boolean, a file path, or a function returning a file path. * ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of SASS code (internal use). Can be a boolean, a file path, or a function returning a file path. **Raises:** If any file operations fail during the dumping process. ### `get_attribute` `get_attribute(self, attr: Attribute) -> Int` Retrieves a specific attribute value from the compiled device function. This method queries the device function for information about its resource requirements, execution capabilities, or other properties defined by the specified attribute. Example: ```mojo from gpu.host import Attribute, DeviceFunction var device_function = DeviceFunction(...) # Get the maximum number of threads per block for this function var max_threads = device_function.get_attribute(Attribute.MAX_THREADS_PER_BLOCK) ``` **Args:** * ​attr (`Attribute`): The attribute to query, defined in the Attribute enum. **Returns:** The integer value of the requested attribute. **Raises:** If the attribute query fails or the attribute is not supported. --- ## DeviceMulticastBuffer `struct DeviceMulticastBuffer[type: DType]` Represents a muticast memory object enables special memory operations to be broadcast across a group of devices. ## Parameters * ​type (`DType`): Data type to be stored in the associated memory regions. ## Implemented traits `AnyType`, `UnknownDestructibility` --- ## DevicePassable This trait marks types as passable to accelerator devices. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `device_type` `alias device_type` Indicate the type being used on accelerator devices. ## Methods ### `get_type_name` `static get_type_name() -> String` Gets the name of the host type (the one implementing this trait). For example, Int would return "Int", DeviceBuffer\[DType.float32] would return "DeviceBuffer\[DType.float32]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive. **Returns:** The host type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name. For example, because DeviceBuffer's device\_type is UnsafePointer, DeviceBuffer\[DType.float32]'s get\_device\_type\_name() should return something like "UnsafePointer\[Scalar\[DType.float32]]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive. **Returns:** The device type's name. --- ## DeviceStream `struct DeviceStream` Represents a CUDA/HIP stream for asynchronous GPU operations. A DeviceStream provides a queue for GPU operations that can execute concurrently with operations in other streams. Operations within a single stream execute in the order they are issued, but operations in different streams may execute in any relative order or concurrently. This abstraction allows for better utilization of GPU resources by enabling overlapping of computation and data transfers. Example: ```mojo from gpu.host import DeviceContext, DeviceStream var ctx = DeviceContext(0) # Select first GPU var stream = DeviceStream(ctx) # Launch operations on the stream # ... # Wait for all operations in the stream to complete stream.synchronize() ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `synchronize` `synchronize(self)` Blocks the calling CPU thread until all operations in this stream complete. This function waits until all previously issued commands in this stream have completed execution. It provides a synchronization point between host and device code. Example: ```mojo # Launch kernel or memory operations on the stream # ... # Wait for completion stream.synchronize() # Now it's safe to use results on the host ``` **Raises:** If synchronization fails. --- ## dict Defines `Dict`, a collection that stores key-value pairs. Dict provides an efficient, O(1) amortized average-time complexity for insert, lookup, and removal of dictionary elements. Its implementation closely mirrors Python's `dict` implementation: * Performance and size are heavily optimized for small dictionaries, but can scale to large dictionaries. * Insertion order is implicitly preserved. Iteration over keys, values, and items have a deterministic order based on insertion. * For more information on the Mojo `Dict` type, see the [Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using Python dictionaries from Mojo, see [Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo). Key elements must implement the `KeyElement` trait, which encompasses Movable, Hashable, and EqualityComparable. It also includes Copyable and Movable until we push references through the standard library types. Value elements must be CollectionElements for a similar reason. Both key and value types must always be Movable so we can resize the dictionary as it grows. See the `Dict` docs for more details. ## Structs * [​`Dict`](/mojo/stdlib/collections/dict/Dict): A container that stores key-value pairs. * [​`DictEntry`](/mojo/stdlib/collections/dict/DictEntry): Store a key-value pair entry inside a dictionary. * [​`OwnedKwargsDict`](/mojo/stdlib/collections/dict/OwnedKwargsDict): Container used to pass owned variadic keyword arguments to functions. ## Traits * [​`KeyElement`](/mojo/stdlib/collections/dict/KeyElement): A trait composition for types which implement all requirements of dictionary keys. Dict keys must minimally be Copyable, Movable, Hashable, and EqualityComparable for a hash map. Until we have references they must also be copyable. --- ## Dict `struct Dict[K: KeyElement, V: Copyable & Movable]` A container that stores key-value pairs. The key type and value type must be specified statically, unlike a Python dictionary, which can accept arbitrary key and value types. The key type must implement the `KeyElement` trait, which encompasses `Movable`, `Hashable`, and `EqualityComparable`. It also includes `Copyable` and `Movable` until we have references. The value type must implement the `Copyable` and `Movable` traits. Examples: ```mojo var d = Dict[String, Int]() d["a"] = 1 d["b"] = 2 print(len(d)) # prints 2 print(d["a"]) # prints 1 print(d.pop("b")) # prints 2 print(len(d)) # prints 1 ``` For more information on the Mojo `Dict` type, see the [Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using Python dictionaries from Mojo, see [Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo). ## Parameters * ​K (`KeyElement`): The type of the dictionary key. Must be `Hashable` and `EqualityComparable` so we can find the key in the map. * ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be Copyable & Movable. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `EMPTY` `alias EMPTY = -1` ### `REMOVED` `alias REMOVED = -2` ## Methods ### `__init__` `__init__(out self)` Initialize an empty dictiontary. `__init__(out self, *, power_of_two_initial_capacity: Int)` Initialize an empty dictiontary with a pre-reserved initial capacity. Examples: ```mojo var x = Dict[Int, Int](power_of_two_initial_capacity = 1024) # Insert (2/3 of 1024) entries without reallocation. ``` **Args:** * ​power\_of\_two\_initial\_capacity (`Int`): At least 8, has to be a power of two. `__init__(out self, owned keys: List[K], owned values: List[V], __dict_literal__: Tuple[])` Constructs a dictionary from the given keys and values. **Args:** * ​keys (`List[K]`): The list of keys to build the dictionary with. * ​values (`List[V]`): The corresponding values to pair with the keys. * ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy an existing dictiontary. **Args:** * ​existing (`Self`): The existing dict. ### `__bool__` `__bool__(self) -> Bool` Check if the dictionary is empty or not. **Returns:** `False` if the dictionary is empty, `True` if there is at least one element. ### `__getitem__` `__getitem__(self, key: K) -> ref [*[0,0]._entries._value.value] V` Retrieve a value out of the dictionary. **Args:** * ​key (`K`): The key to retrieve. **Returns:** The value associated with the key, if it's present. **Raises:** "KeyError" if the key isn't present. ### `__setitem__` `__setitem__(mut self, owned key: K, owned value: V)` Set a value in the dictionary by key. **Args:** * ​key (`K`): The key to associate with the specified value. * ​value (`V`): The data to store in the dictionary. ### `__contains__` `__contains__(self, key: K) -> Bool` Check if a given key is in the dictionary or not. **Args:** * ​key (`K`): The key to check. **Returns:** True if there key exists in the dictionary, False otherwise. ### `__or__` `__or__(self, other: Self) -> Self` Merge self with other and return the result as a new dict. **Args:** * ​other (`Self`): The dictionary to merge with. **Returns:** The result of the merge. ### `__ior__` `__ior__(mut self, other: Self)` Merge self with other in place. **Args:** * ​other (`Self`): The dictionary to merge with. ### `copy` `copy(self) -> Self` Copy an existing dictiontary. **Returns:** A copy of the value. ### `fromkeys` `static fromkeys(keys: List[K, hint_trivial_type], value: V) -> Self` Create a new dictionary with keys from list and values set to value. **Args:** * ​keys (`List[K, hint_trivial_type]`): The keys to set. * ​value (`V`): The value to set. **Returns:** The new dictionary. `static fromkeys(keys: List[K, hint_trivial_type], value: Optional[V] = Optional(None)) -> Dict[K, Optional[V]]` Create a new dictionary with keys from list and values set to value. **Args:** * ​keys (`List[K, hint_trivial_type]`): The keys to set. * ​value (`Optional[V]`): The value to set. **Returns:** The new dictionary. ### `__iter__` `__iter__(ref self) -> _DictKeyIter[K, V, self_is_origin]` Iterate over the dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `__reversed__` `__reversed__(ref self) -> _DictKeyIter[K, V, self_is_origin, False]` Iterate backwards over the dict keys, returning immutable references. **Returns:** A reversed iterator of immutable references to the dict keys. ### `__len__` `__len__(self) -> Int` The number of elements currently stored in the dictionary. **Returns:** The number of elements currently stored in the dictionary. ### `__str__` `__str__[T: KeyElement & Representable, U: Copyable & Movable & Representable, //](self: Dict[T, U]) -> String` Returns a string representation of a `Dict`. Notes: Since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo var my_dict = Dict[Int, Float64]() my_dict[1] = 1.1 my_dict[2] = 2.2 dict_as_string = my_dict.__str__() print(dict_as_string) # prints "{1: 1.1, 2: 2.2}" ``` When the compiler supports conditional methods, then a simple `String(my_dict)` will be enough. **Parameters:** * ​T (`KeyElement & Representable`): The type of the keys in the Dict. Must implement the traits `Representable` and `KeyElement`. * ​U (`Copyable & Movable & Representable`): The type of the values in the Dict. Must implement the traits `Representable`, `Copyable` and `Movable`. **Returns:** A string representation of the Dict. ### `find` `find(self, key: K) -> Optional[V]` Find a value in the dictionary by key. **Args:** * ​key (`K`): The key to search for in the dictionary. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. ### `get` `get(self, key: K) -> Optional[V]` Get a value from the dictionary by key. **Args:** * ​key (`K`): The key to search for in the dictionary. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. `get(self, key: K, default: V) -> V` Get a value from the dictionary by key. **Args:** * ​key (`K`): The key to search for in the dictionary. * ​default (`V`): Default value to return. **Returns:** A copy of the value if it was present, otherwise default. ### `pop` `pop(mut self, key: K, owned default: V) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`K`): The key to remove from the dictionary. * ​default (`V`): A default value to return if the key was not found instead of raising. **Returns:** The value associated with the key, if it was in the dictionary. If it wasn't, return the provided default value instead. `pop(mut self, key: K) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`K`): The key to remove from the dictionary. **Returns:** The value associated with the key, if it was in the dictionary. Raises otherwise. **Raises:** "KeyError" if the key was not present in the dictionary. ### `popitem` `popitem(mut self) -> DictEntry[K, V]` Remove and return a (key, value) pair from the dictionary. Notes: Pairs are returned in LIFO order. popitem() is useful to destructively iterate over a dictionary, as often used in set algorithms. If the dictionary is empty, calling popitem() raises a KeyError. **Returns:** Last dictionary item **Raises:** "KeyError" if the dictionary is empty. ### `keys` `keys(ref self) -> _DictKeyIter[K, V, self_is_origin]` Iterate over the dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `values` `values(ref self) -> _DictValueIter[K, V, self_is_origin]` Iterate over the dict's values as references. **Returns:** An iterator of references to the dictionary values. ### `items` `items(ref self) -> _DictEntryIter[K, V, self_is_origin]` Iterate over the dict's entries as immutable references. Examples: ```mojo var my_dict = Dict[String, Int]() my_dict["a"] = 1 my_dict["b"] = 2 for e in my_dict.items(): print(e[].key, e[].value) ``` Notes: These can't yet be unpacked like Python dict items, but you can access the key and value as attributes. **Returns:** An iterator of immutable references to the dictionary entries. ### `update` `update(mut self, other: Self, /)` Update the dictionary with the key/value pairs from other, overwriting existing keys. Notes: The argument must be positional only. **Args:** * ​other (`Self`): The dictionary to update from. ### `clear` `clear(mut self)` Remove all elements from the dictionary. ### `setdefault` `setdefault(mut self, key: K, owned default: V) -> ref [*[0,0]._entries._value.value] V` Get a value from the dictionary by key, or set it to a default if it doesn't exist. **Args:** * ​key (`K`): The key to search for in the dictionary. * ​default (`V`): The default value to set if the key is not present. **Returns:** The value associated with the key, or the default value if it wasn't present. --- ## DictEntry `struct DictEntry[K: KeyElement, V: Copyable & Movable]` Store a key-value pair entry inside a dictionary. ## Parameters * ​K (`KeyElement`): The key type of the dict. Must be Hashable+EqualityComparable. * ​V (`Copyable & Movable`): The value type of the dict. ## Fields * ​hash (`SIMD[uint64, 1]`): `key.__hash__()`, stored so hashing isn't re-computed during dict lookup. * ​key (`K`): The unique key for the entry. * ​value (`V`): The value associated with the key. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, owned key: K, owned value: V)` Create an entry from a key and value, computing the hash. **Args:** * ​key (`K`): The key of the entry. * ​value (`V`): The value of the entry. ### `copy` `copy(self) -> Self` Copy an existing entry. **Returns:** A copy of the value. ### `reap_value` `reap_value(owned self, out result: V)` Take the value from an owned entry. **Returns:** The value of the entry. --- ## dim This module implements the dim type. ## Structs * [​`Dim`](/mojo/stdlib/gpu/host/dim/Dim): Represents a dimension with up to three components (x, y, z). --- ## Dim `@register_passable(trivial)` `struct Dim` A static or dynamic dimension modeled with an optional integer. This class is meant to represent an optional static dimension. When a value is present, the dimension has that static value. When a value is not present, the dimension is dynamic. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ImplicitlyBoolable`, `Indexer`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `@implicit` `__init__[I: Intable](value: I) -> Self` Creates a statically-known dimension. **Parameters:** * ​I (`Intable`): The Intable type. **Args:** * ​value (`I`): The static dimension value. `@implicit` `__init__[I: Indexer](value: I) -> Self` Creates a statically-known dimension. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​value (`I`): The static dimension value. `@implicit` `__init__(value: index) -> Self` Creates a statically-known dimension. **Args:** * ​value (`index`): The static dimension value. `@implicit` `__init__(value: Int) -> Self` Creates a statically-known dimension. **Args:** * ​value (`Int`): The static dimension value. `__init__() -> Self` Creates a dynamic dimension with no static value. ### `__bool__` `__bool__(self) -> Bool` Returns True if the dimension has a static value. **Returns:** Whether the dimension has a static value. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares two dimensions for equality. **Args:** * ​rhs (`Self`): The other dimension. **Returns:** True if the dimensions are the same. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compare two dimensions for inequality. **Args:** * ​rhs (`Self`): The dimension to compare. **Returns:** True if they are not equal. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Multiplies two dimensions. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The other dimension. **Returns:** The product of the two dimensions. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Divide by the given dimension and round towards negative infinity. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The divisor dimension. **Returns:** The floor division of the two dimensions. ### `__rfloordiv__` `__rfloordiv__(self, rhs: Self) -> Self` Divide the given argument by self and round towards negative infinity. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The dimension to divide by this Dim. **Returns:** The floor of the argument divided by self. ### `__imul__` `__imul__(mut self, rhs: Self)` Inplace multiplies two dimensions. If either are unknown, the result is unknown as well. **Args:** * ​rhs (`Self`): The other dimension. ### `__as_bool__` `__as_bool__(self) -> Bool` Returns True if the dimension has a static value. **Returns:** Whether the dimension has a static value. ### `has_value` `has_value(self) -> Bool` Returns True if the dimension has a static value. **Returns:** Whether the dimension has a static value. ### `is_dynamic` `is_dynamic(self) -> Bool` Returns True if the dimension has a dynamic value. **Returns:** Whether the dimension is dynamic. ### `get` `get(self) -> Int` Gets the static dimension value. **Returns:** The static dimension value. ### `is_multiple` `is_multiple[alignment: Int](self) -> Bool` Checks if the dimension is aligned. **Parameters:** * ​alignment (`Int`): The alignment requirement. **Returns:** Whether the dimension is aligned. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self) -> Int` Gets the static dimension value. **Returns:** The static dimension value. ### `__str__` `__str__(self) -> String` Converts the Dim to a String. If the value is unknown, then the string "?" is returned. **Returns:** The string representation of the type. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this DimList to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `or_else` `or_else(self, default: Int) -> Int` Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present. **Args:** * ​default (`Int`): The new value to use if no value was present. **Returns:** The underlying value contained in the Optional or a default value. --- ## Dim `@register_passable(trivial)` `struct Dim` Represents a dimension with up to three components (x, y, z). This struct is commonly used to represent grid and block dimensions for kernel launches. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `@implicit` `__init__[I: Indexer](x: I) -> Self` Initializes Dim with a single indexable value for x. y and z dimensions are set to 1. **Parameters:** * ​I (`Indexer`): The type of the indexable value. **Args:** * ​x (`I`): The value for the x dimension. `__init__[I0: Indexer, I1: Indexer](x: I0, y: I1) -> Self` Initializes Dim with indexable values for x and y. z dimension is set to 1. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value. * ​I1 (`Indexer`): The type of the second indexable value. **Args:** * ​x (`I0`): The value for the x dimension. * ​y (`I1`): The value for the y dimension. `__init__[I0: Indexer, I1: Indexer, I2: Indexer](x: I0, y: I1, z: I2) -> Self` Initializes Dim with indexable values for x, y, and z. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value. * ​I1 (`Indexer`): The type of the second indexable value. * ​I2 (`Indexer`): The type of the third indexable value. **Args:** * ​x (`I0`): The value for the x dimension. * ​y (`I1`): The value for the y dimension. * ​z (`I2`): The value for the z dimension. `@implicit` `__init__[I: Indexer](dims: Tuple[I]) -> Self` Initializes Dim with a tuple containing a single indexable value. y and z dimensions are set to 1. **Parameters:** * ​I (`Indexer`): The type of the indexable value in the tuple. **Args:** * ​dims (`Tuple[I]`): A tuple with one element for x dimension. `@implicit` `__init__[I0: Indexer, I1: Indexer](dims: Tuple[I0, I1]) -> Self` Initializes Dim with a tuple of two indexable values. The z dimension is set to 1. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value in the tuple. * ​I1 (`Indexer`): The type of the second indexable value in the tuple. **Args:** * ​dims (`Tuple[I0, I1]`): A tuple with two elements: x and y dimensions. `@implicit` `__init__[I0: Indexer, I1: Indexer, I2: Indexer](dims: Tuple[I0, I1, I2]) -> Self` Initializes Dim with a tuple of three indexable values. **Parameters:** * ​I0 (`Indexer`): The type of the first indexable value in the tuple. * ​I1 (`Indexer`): The type of the second indexable value in the tuple. * ​I2 (`Indexer`): The type of the third indexable value in the tuple. **Args:** * ​dims (`Tuple[I0, I1, I2]`): Tuple with three elements: x, y, and z dimensions. ### `__getitem__` `__getitem__(self, idx: Int) -> Int` Gets the dimension value at the specified index. **Args:** * ​idx (`Int`): The index (0 for x, 1 for y, 2 for z). **Returns:** The value of the dimension at the given index. ### `__str__` `__str__(self) -> String` Returns a string representation of the Dim. **Returns:** String representation of this Dim object. ### `__repr__` `__repr__(self) -> String` Returns a string representation of the Dim. **Returns:** String representation of this Dim object. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a formatted string representation of the Dim. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The Writer to write to. ### `z` `z(self) -> Int` Returns the z dimension. **Returns:** The value of the z dimension. ### `y` `y(self) -> Int` Returns the y dimension. **Returns:** The value of the y dimension. ### `x` `x(self) -> Int` Returns the x dimension. **Returns:** The value of the x dimension. --- ## dimlist Provides utilities for working with static and variadic lists. You can import these APIs from the `buffer` package. For example: ```mojo from buffer import Dim ``` ## Structs * [​`Dim`](/mojo/stdlib/buffer/dimlist/Dim): A static or dynamic dimension modeled with an optional integer. * [​`DimList`](/mojo/stdlib/buffer/dimlist/DimList): This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension. --- ## DimList `@register_passable(trivial)` `struct DimList` This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension. ## Fields * ​value (`VariadicList[Dim]`): The underlying storage for the list of dimensions. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `@implicit` `__init__[Intable: Intable](value: Intable) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​Intable (`Intable`): A type able to be converted to an `Int`. **Args:** * ​value (`Intable`): The initial dim values list. `@implicit` `__init__[I: Indexer](values: Tuple[I]) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​values (`Tuple[I]`): The initial dim values list. `@implicit` `__init__[I0: Indexer, I1: Indexer](values: Tuple[I0, I1]) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. **Args:** * ​values (`Tuple[I0, I1]`): The initial dim values list. `@implicit` `__init__[I0: Indexer, I1: Indexer, I2: Indexer](values: Tuple[I0, I1, I2]) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. * ​I2 (`Indexer`): A type that can be used as an Index. **Args:** * ​values (`Tuple[I0, I1, I2]`): The initial dim values list. `__init__[I0: Indexer, I1: Indexer](val0: I0, val1: I1) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. **Args:** * ​val0 (`I0`): The initial dim value. * ​val1 (`I1`): The initial dim value. `__init__[I0: Indexer, I1: Indexer, I2: Indexer](val0: I0, val1: I1, val2: I2) -> Self` Creates a dimension list from the given list of values. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. * ​I2 (`Indexer`): A type that can be used as an Index. **Args:** * ​val0 (`I0`): The initial dim value. * ​val1 (`I1`): The initial dim value. * ​val2 (`I2`): The initial dim value. `__init__[I0: Indexer, I1: Indexer, I2: Indexer, I3: Indexer](val0: I0, val1: I1, val2: I2, val3: I3) -> Self` Creates a statically-known dimension. **Parameters:** * ​I0 (`Indexer`): A type that can be used as an Index. * ​I1 (`Indexer`): A type that can be used as an Index. * ​I2 (`Indexer`): A type that can be used as an Index. * ​I3 (`Indexer`): A type that can be used as an Index. **Args:** * ​val0 (`I0`): The initial dim value. * ​val1 (`I1`): The initial dim value. * ​val2 (`I2`): The initial dim value. * ​val3 (`I3`): The initial dim value. `@implicit` `__init__(values: VariadicList[Dim]) -> Self` Creates a dimension list from the given list of values. **Args:** * ​values (`VariadicList[Dim]`): The initial dim values list. `@implicit` `__init__(*values: Dim) -> Self` Creates a dimension list from the given Dim values. **Args:** * ​\*values (`Dim`): The initial dim values. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares two DimLists for equality. DimLists are considered equal if all non-dynamic Dims have similar values and all dynamic Dims in self are also dynamic in rhs. **Args:** * ​rhs (`Self`): The other DimList. **Returns:** True if the DimLists are the same. ### `__len__` `__len__(self) -> Int` Gets the size of the DimList. **Returns:** The number of elements in the DimList. ### `get` `get[i: Int](self) -> Int` Gets the static dimension value at a specified index. **Parameters:** * ​i (`Int`): The dimension index. **Returns:** The static dimension value at the specified index. ### `at` `at[i: Int](self) -> Dim` Gets the dimension at a specified index. **Parameters:** * ​i (`Int`): The dimension index. **Returns:** The dimension at the specified index. ### `has_value` `has_value[i: Int](self) -> Bool` Returns True if the dimension at the given index has a static value. **Parameters:** * ​i (`Int`): The dimension index. **Returns:** Whether the specified dimension has a static value. ### `product` `product[length: Int](self) -> Dim` Computes the product of the first `length` dimensions in the list. If any are dynamic, the result is a dynamic dimension value. **Parameters:** * ​length (`Int`): The number of elements in the list. **Returns:** The product of the first `length` dimensions. `product[start: Int, end: Int](self) -> Dim` Computes the product of a range of the dimensions in the list. If any in the range are dynamic, the result is a dynamic dimension value. **Parameters:** * ​start (`Int`): The starting index. * ​end (`Int`): The end index. **Returns:** The product of all the dimensions. `product(self) -> Dim` Computes the product of all the dimensions in the list. If any are dynamic, the result is a dynamic dimension value. **Returns:** The product of all the dimensions. ### `contains` `contains[length: Int](self, value: Dim) -> Bool` Determines whether the dimension list contains a specified dimension value. **Parameters:** * ​length (`Int`): The number of elements in the list. **Args:** * ​value (`Dim`): The value to find. **Returns:** True if the list contains a dimension of the specified value. ### `all_known` `all_known[length: Int](self) -> Bool` Determines whether all dimensions are statically known. **Parameters:** * ​length (`Int`): The number of elements in the list. **Returns:** True if all dimensions have a static value. `all_known[start: Int, end: Int](self) -> Bool` Determines whether all dimensions within \[start, end) are statically known. **Parameters:** * ​start (`Int`): The first queried dimension. * ​end (`Int`): The last queried dimension. **Returns:** True if all queried dimensions have a static value. ### `into_index_list` `into_index_list[rank: Int](self) -> IndexList[rank]` Copy the DimList values into an `IndexList`, providing the rank. ```mojo from buffer import DimList var dim_list = DimList(2, 4) var index_list = dim_list.into_index_list[rank=2]() ``` . **Parameters:** * ​rank (`Int`): The rank of the output IndexList. **Returns:** An IndexList with the same dimensions as the DimList. ### `create_unknown` `static create_unknown[length: Int]() -> Self` Creates a dimension list of all dynamic dimension values. **Parameters:** * ​length (`Int`): The number of elements in the list. **Returns:** A list of all dynamic dimension values. ### `__str__` `__str__(self) -> String` Converts the DimList to a String. The String is a comma separated list of the string representation of Dim. **Returns:** The string representation of the type. ### `__repr__` `__repr__(self) -> String` Converts the DimList to a readable String representation. **Returns:** The string representation of the type. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this DimList to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. --- ## dirname `dirname[PathLike: PathLike, //](path: PathLike) -> String` Returns the directory component of a pathname. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to a file. **Returns:** The directory component of a pathname. --- ## dispatch_get_kernel_type `dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() raises capturing -> None](m: Int, n: Int, k: Int)` `dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() capturing -> None](m: Int, n: Int, k: Int)` --- ## dispatch_mask_and_score_mod `dispatch_mask_and_score_mod[mask_type: String, score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, local_window_size: Int = -1, num_heads: Int = -1]()` --- ## dispatch_materialized_mask_and_score_mod `dispatch_materialized_mask_and_score_mod[score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, num_heads: Int = -1](mask_nd: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start_pos_nd: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))` --- ## dispatch_table_a100_gpu ## Functions * [​`create_matmul_configs_ampere`](./create_matmul_configs_ampere): * [​`get_dispatch_table`](./get_dispatch_table): --- ## dispatch_table_amd ## Functions * [​`create_tile_configs`](./create_tile_configs): --- ## distributed_matmul ## Functions * [​`matmul_allreduce`](./matmul_allreduce): Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation. --- ## distributed_transformer ## `DistributedTransformer` {#max.nn.transformer.distributed_transformer.DistributedTransformer} > *class* max.nn.transformer.distributed\_transformer.DistributedTransformer(dim, n\_heads, layers, norm, output, embedding, kv\_params, kv\_collection\_constructor, devices, return\_logits=ReturnLogits.LAST\_TOKEN) Transformer model consisting for TransformerBlock layers. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **layers** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`DistributedTransformerBlock`](#max.nn.transformer.distributed_transformer.DistributedTransformerBlock) `]` ) * **norm** ([`DistributedRMSNorm`](../norm/rms_norm.md#max.nn.norm.rms_norm.DistributedRMSNorm) ) * **output** ([`ColumnParallelLinear`](../linear.md#max.nn.linear.ColumnParallelLinear) ) * **embedding** ([`VocabParallelEmbedding`](../embedding.md#max.nn.embedding.VocabParallelEmbedding) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **kv\_collection\_constructor** ([`FetchContinuousBatchingKVCacheCollection`](../kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.FetchContinuousBatchingKVCacheCollection) `|` `FetchPagedKVCacheCollection` ) * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) * **return\_logits** ([`ReturnLogits`](transformer.md#max.nn.transformer.transformer.ReturnLogits) ) ## `DistributedTransformerBlock` {#max.nn.transformer.distributed_transformer.DistributedTransformerBlock} > *class* max.nn.transformer.distributed\_transformer.DistributedTransformerBlock(attention, mlp, attention\_norm, mlp\_norm, devices, use\_subgraph=False) Stack of Attention, FeedForward, and RMSNorm layers. **Parameters:** * **attention** ([`Module`](../layer.md#max.nn.layer.Module) ) * **mlp** ([`Module`](../layer.md#max.nn.layer.Module) ) * **attention\_norm** ([`DistributedRMSNorm`](../norm/rms_norm.md#max.nn.norm.rms_norm.DistributedRMSNorm) ) * **mlp\_norm** ([`DistributedRMSNorm`](../norm/rms_norm.md#max.nn.norm.rms_norm.DistributedRMSNorm) ) * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) * **use\_subgraph** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `build_subgraph()` {#max.nn.transformer.distributed_transformer.DistributedTransformerBlock.build_subgraph} > build\_subgraph(name) **Parameters:** **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) **Return type:** [*Module*](../layer.md#max.nn.layer.Module) ## `distribute_value()` {#max.nn.transformer.distributed_transformer.distribute_value} > max.nn.transformer.distributed\_transformer.distribute\_value(v, devices) **Parameters:** **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) --- ## divmod `divmod(numerator: Int, denominator: Int) -> Tuple[Int, Int]` Performs integer division and returns the quotient and the remainder. Currently supported only for integers. Support for more standard library types like Int8, Int16... is planned. This method calls `a.__divmod__(b)`, thus, the actual implementation of divmod should go in the `__divmod__` method of the struct of `a`. **Args:** * ​numerator (`Int`): The dividend. * ​denominator (`Int`): The divisor. **Returns:** A `Tuple` containing the quotient and the remainder. `divmod(numerator: UInt, denominator: UInt) -> Tuple[UInt, UInt]` Performs integer division and returns the quotient and the remainder. Currently supported only for integers. Support for more standard library types like Int8, Int16... is planned. This method calls `a.__divmod__(b)`, thus, the actual implementation of divmod should go in the `__divmod__` method of the struct of `a`. **Args:** * ​numerator (`UInt`): The dividend. * ​denominator (`UInt`): The divisor. **Returns:** A `Tuple` containing the quotient and the remainder. --- ## DLHandle `@register_passable(trivial)` `struct DLHandle` Represents a dynamically linked library that can be loaded and unloaded. The library is loaded on initialization and unloaded by `close`. ## Fields * ​handle (`UnsafePointer[NoneType]`): The handle to the dynamic library. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, flags: Int = (256 if os_is_linux() else 8 | 2))` Initialize a dynamic library handle to all global symbols in the current process. On POXIX-compatible operating systems, this performs `dlopen(nullptr, flags)`. **Args:** * ​flags (`Int`): The flags to load the dynamic library. `__init__[PathLike: PathLike, //](out self, path: PathLike, flags: Int = (256 if os_is_linux() else 8 | 2))` Initialize a DLHandle object by loading the dynamic library at the given path. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the `os.PathLike` trait. **Args:** * ​path (`PathLike`): The path to the dynamic library file. * ​flags (`Int`): The flags to load the dynamic library. ### `__bool__` `__bool__(self) -> Bool` Checks if the handle is valid. **Returns:** True if the DLHandle is not null and False otherwise. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `check_symbol` `check_symbol(self, owned name: String) -> Bool` Check that the symbol exists in the dynamic library. **Args:** * ​name (`String`): The symbol to check. **Returns:** `True` if the symbol exists. ### `close` `close(mut self)` Delete the DLHandle object unloading the associated dynamic library. ### `get_function` `get_function[result_type: AnyTrivialRegType](self, owned name: String) -> result_type` Returns a handle to the function with the given name in the dynamic library. **Parameters:** * ​result\_type (`AnyTrivialRegType`): The type of the function pointer to return. **Args:** * ​name (`String`): The name of the function to get the handle for. **Returns:** A handle to the function. ### `get_symbol` `get_symbol[result_type: AnyType](self, name: StringSlice[origin]) -> UnsafePointer[result_type]` Returns a pointer to the symbol with the given name in the dynamic library. **Parameters:** * ​result\_type (`AnyType`): The type of the symbol to return. **Args:** * ​name (`StringSlice[origin]`): The name of the symbol to get the handle for. **Returns:** A pointer to the symbol. `get_symbol[result_type: AnyType](self, *, cstr_name: UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[result_type]` Returns a pointer to the symbol with the given name in the dynamic library. **Parameters:** * ​result\_type (`AnyType`): The type of the symbol to return. **Args:** * ​cstr\_name (`UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The name of the symbol to get the handle for. **Returns:** A pointer to the symbol. ### `call` `call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType, *T: AnyType = *?](self, *args: *T) -> return_type` Call a function with any amount of arguments. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the function. * ​return\_type (`AnyTrivialRegType`): The return type of the function. * ​\*T (`AnyType`): The types of `args`. **Args:** * ​\*args (`*T`): The arguments. **Returns:** The result. `call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType](self, args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type` Call a function with any amount of arguments. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the function. * ​return\_type (`AnyTrivialRegType`): The return type of the function. **Args:** * ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments. **Returns:** The result. --- ## dot_at_b `dot_at_b(c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])` --- ## dot_at_b_impl `dot_at_b_impl(c: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], a: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], b: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))])` `dot_at_b_impl(c: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], a: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], b: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))])` --- ## dot_i16_to_i32_AVX2 `dot_i16_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the two words in each int32 element of a and b plus a int32 from src. **Constraints:** Requires AVX2. The size of the output vector must be 4, 8 or 16. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A int16 SIMD vector. * ​b (`SIMD[b_type, width]`): A int16 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i16_to_i32_x86 `dot_i16_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2. **Constraints:** Requires AVX512\_VNNI or AVX2. The size of the output vector must be 4, 8 or 16. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A int16 SIMD vector. * ​b (`SIMD[b_type, width]`): A int16 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_AVX2 `dot_i8_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src. **Constraints:** Requires AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_saturated_AVX2 `dot_i8_to_i32_saturated_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src. **Constraints:** Requires AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,127] not \[0, 255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_saturated_x86 `dot_i8_to_i32_saturated_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. **Constraints:** Requires AVX512\_VNNI or AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,127] not \[0, 255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## dot_i8_to_i32_x86 `dot_i8_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. **Constraints:** Requires AVX512\_VNNI or AVX2. The size of the output vector must be 4, 8 or 16. The a argument has range \[0,255]. The b argument has range \[-128,127]. **Parameters:** * ​width (`Int`): Size of the output SIMD vector. * ​a\_type (`DType`): The DType for a. * ​b\_type (`DType`): The DType for b. * ​c\_type (`DType`): The DType for c. **Args:** * ​src (`SIMD[c_type, width]`): A int32 SIMD vector. * ​a (`SIMD[a_type, width]`): A uint8 SIMD vector. * ​b (`SIMD[b_type, width]`): A int8 SIMD vector. **Returns:** A SIMD vector of width elements. --- ## downcast `downcast(layout: Layout, factor: Int) -> Layout` Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved. This function is useful for converting between different data type granularities, such as from uint128 to bf16. **Args:** * ​layout (`Layout`): The layout to downcast. * ​factor (`Int`): The number of elements to split into. **Returns:** A new layout with adjusted shape and stride for the finer granularity. --- ## driver Exposes APIs for interacting with hardware, such as allocating tensors on a GPU and moving tensors between the CPU and GPU. It provides interfaces for memory management, device properties, and hardware monitoring. Through these APIs, you can control data placement, track resource utilization, and configure device settings for optimal performance. For example, you can use the following code to use an accelerator if one is available, otherwise use the CPU: ```python from max import driver device = driver.CPU() if driver.accelerator_count() == 0 else driver.Accelerator() print(f"Using {device} device") ``` ## `Accelerator` {#max.driver.Accelerator} > *class* max.driver.Accelerator(self, id: [int](https://docs.python.org/3/library/functions.html#int) = -1) Creates an accelerator device with the specified ID. Provides access to GPU or other hardware accelerators in the system. ```python from max import driver device = driver.Accelerator() # Or specify GPU id device = driver.Accelerator(id=0) # First GPU device = driver.Accelerator(id=1) # Second GPU # Get device id device_id = device.id ``` **Parameters:** **id** ([`int`](https://docs.python.org/3/library/functions.html#int) `,` `optional` ) – The device ID to use. Defaults to -1, which selects the first available accelerator. **Returns:** A new Accelerator device object. **Return type:** [Accelerator](#max.driver.Accelerator) ## `CPU` {#max.driver.CPU} > *class* max.driver.CPU(self, id: [int](https://docs.python.org/3/library/functions.html#int) = -1) Creates a CPU device. ```python from max import driver # Create default CPU device device = driver.CPU() # Device id is always 0 for CPU devices device_id = device.id ``` **Parameters:** **id** ([`int`](https://docs.python.org/3/library/functions.html#int) `,` `optional` ) – The device ID to use. Defaults to -1. **Returns:** A new CPU device object. **Return type:** [CPU](#max.driver.CPU) ## `DLPackArray` {#max.driver.DLPackArray} > *class* max.driver.DLPackArray(\*args, \*\*kwargs) ## `Device` {#max.driver.Device} > *class* max.driver.Device ### `api` {#max.driver.Device.api} > *property* api Returns the API used to program the device. Possible values are: * `cpu` for host devices. * `cuda` for NVIDIA GPUs. * `hip` for AMD GPUs. ```python from max import driver device = driver.CPU() device.api ``` ### `can_access` {#max.driver.Device.can_access} > can\_access Checks if this device can directly access memory of another device. ```python from max import driver gpu0 = driver.Accelerator(id=0) gpu1 = driver.Accelerator(id=1) if gpu0.can_access(gpu1): print("GPU0 can directly access GPU1 memory.") ``` **Parameters:** **other** ([`Device`](#max.driver.Device) ) – The other device to check peer access against. **Returns:** True if peer access is possible, False otherwise. **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ### `cpu` {#max.driver.Device.cpu} > cpu *= \* ### `default_stream` {#max.driver.Device.default_stream} > *property* default\_stream Returns the default stream for this device. The default stream is initialized when the device object is created. **Returns:** The default execution stream for this device. **Return type:** DeviceStream ### `id` {#max.driver.Device.id} > *property* id Returns a zero-based device id. For a CPU device this is always 0. For GPU accelerators this is the id of the device relative to this host. Along with the `label`, an id can uniquely identify a device, e.g. `gpu:0`, `gpu:1`. ```python from max import driver device = driver.Accelerator() device_id = device.id ``` **Returns:** The device ID. **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `is_compatible` {#max.driver.Device.is_compatible} > *property* is\_compatible Returns whether this device is compatible with MAX. **Returns:** True if the device is compatible with MAX, False otherwise. **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ### `is_host` {#max.driver.Device.is_host} > *property* is\_host Whether this device is the CPU (host) device. ```python from max import driver device = driver.CPU() device.is_host ``` ### `label` {#max.driver.Device.label} > *property* label Returns device label. Possible values are: * `cpu` for host devices. * `gpu` for accelerators. ```python from max import driver device = driver.CPU() device.label ``` ### `stats` {#max.driver.Device.stats} > *property* stats Returns utilization data for the device. ```python from max import driver device = driver.CPU() stats = device.stats ``` **Returns:** A dictionary containing device utilization statistics. **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict) ### `synchronize` {#max.driver.Device.synchronize} > synchronize Ensures all operations on this device complete before returning. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If any enqueued operations had an internal error. ## `DeviceSpec` {#max.driver.DeviceSpec} > *class* max.driver.DeviceSpec(id, device\_type='cpu') Specification for a device, containing its ID and type. This class provides a way to specify device parameters like ID and type (CPU/GPU) for creating Device instances. **Parameters:** * **id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device\_type** ([`Literal`](https://docs.python.org/3/library/typing.html#typing.Literal) `[` `'cpu'` `,` `'gpu'` `]` ) ### `accelerator()` {#max.driver.DeviceSpec.accelerator} > *static* accelerator(id=0) Creates an accelerator (GPU) device specification. **Parameters:** **id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `cpu()` {#max.driver.DeviceSpec.cpu} > *static* cpu(id=-1) Creates a CPU device specification. **Parameters:** **id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `device_type` {#max.driver.DeviceSpec.device_type} > device\_type\*: [Literal](https://docs.python.org/3/library/typing.html#typing.Literal)\['cpu', 'gpu']\* *= 'cpu'* Type of specified device. ### `id` {#max.driver.DeviceSpec.id} > id\*: [int](https://docs.python.org/3/library/functions.html#int)\* Provided id for this device. ## `Tensor` {#max.driver.Tensor} > *class* max.driver.Tensor(self, dtype: [max.\_core.dtype.DType](dtype.md#max.dtype.DType), shape: [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)], device: [max.\_core.driver.Device](#max.driver.Device) | [None](https://docs.python.org/3/library/constants.html#None) = None, pinned: [bool](https://docs.python.org/3/library/functions.html#bool) = False) > *class* max.driver.Tensor(self, dtype: [max.\_core.dtype.DType](dtype.md#max.dtype.DType), shape: [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)], stream: max.\_core.driver.DeviceStream, pinned: [bool](https://docs.python.org/3/library/functions.html#bool) = False) > *class* max.driver.Tensor(self, shape: ndarray\[writable=False], device: max.\_core.driver.Device) > *class* max.driver.Tensor(self, other: [max.\_core.driver.Tensor](#max.driver.Tensor)) Device-resident tensor representation. Allocates memory onto a given device with the provided shape and dtype. Tensors can be sliced to provide strided views of the underlying memory, but any tensors input into model execution must be contiguous. Supports numpy-style slicing but does not currently support setting items across multiple indices. ```python from max import driver from max.dtype import DType # Create a tensor on CPU cpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.float32) # Create a tensor on GPU gpu = driver.Accelerator() gpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.float32, device=gpu) ``` **Parameters:** * **dtype** ([`DType`](dtype.md#max.dtype.DType) ) – Data type of tensor elements. * **shape** (`Sequence` `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Tuple of positive, non-zero integers denoting the tensor shape. * **device** ([`Device`](#max.driver.Device) `,` `optional` ) – Device to allocate tensor onto. Defaults to the CPU. * **pinned** ([`bool`](https://docs.python.org/3/library/functions.html#bool) `,` `optional` ) – If True, memory is page-locked (pinned). Defaults to False. * **stream** (`DeviceStream` `,` `optional` ) – Stream to associate the tensor with. Overloaded function. 1. `__init__(self, dtype: max._core.dtype.DType, shape: collections.abc.Sequence[int], device: max._core.driver.Device | None = None, pinned: bool = False) -> None` 2. `__init__(self, dtype: max._core.dtype.DType, shape: collections.abc.Sequence[int], stream: max._core.driver.DeviceStream, pinned: bool = False) -> None` 3. `__init__(self, shape: ndarray[writable=False], device: max._core.driver.Device) -> None` 4. `__init__(self, other: max._core.driver.Tensor) -> None` > Moves the internals from an existing Tensor object into a new Tensor object. > Primarily used for initializing subclasses with existing Tensors. ### `contiguous()` {#max.driver.Tensor.contiguous} > contiguous() Creates a contiguous copy of the parent tensor. **Return type:** [*Tensor*](#max.driver.Tensor) ### `copy` {#max.driver.Tensor.copy} > copy Overloaded function. 1. `copy(self, stream: max._core.driver.DeviceStream) -> max._core.driver.Tensor` > Creates a deep copy on the device associated with the stream. > Args: > : stream (DeviceStream): The stream to associate the new tensor with. > Returns: > : Tensor: A new tensor that is a copy of this tensor. 2. `copy(self, device: max._core.driver.Device | None = None) -> max._core.driver.Tensor` > Creates a deep copy on an optionally given device. > If device is None (default), a copy is created on the same device. > > ```python > from max import driver > from max.dtype import DType > ​ > cpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.bfloat16, device=driver.CPU()) > cpu_copy = cpu_tensor.copy() > ​ > # Copy to GPU > gpu = driver.Accelerator() > gpu_copy = cpu_tensor.copy(device=gpu) > ``` > Args: > : device (Device, optional): The device to create the copy on. > : Defaults to None (same device). > Returns: > : Tensor: A new tensor that is a copy of this tensor. ### `device` {#max.driver.Tensor.device} > *property* device Device on which tensor is resident. ### `dtype` {#max.driver.Tensor.dtype} > *property* dtype DType of constituent elements in tensor. ### `element_size` {#max.driver.Tensor.element_size} > *property* element\_size Return the size of the element type in bytes. ### `from_dlpack()` {#max.driver.Tensor.from_dlpack} > from\_dlpack(\*, copy=None) Create a tensor from an object implementing the dlpack protocol. This usually does not result in a copy, and the producer of the object retains ownership of the underlying memory. **Parameters:** * **array** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) * **copy** ([`bool`](https://docs.python.org/3/library/functions.html#bool) `|` `None` ) **Return type:** [*Tensor*](#max.driver.Tensor) ### `from_numpy()` {#max.driver.Tensor.from_numpy} > from\_numpy() Creates a tensor from a provided numpy array on the host device. The underlying data is not copied unless the array is noncontiguous. If it is, a contiguous copy will be returned. **Parameters:** **arr** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** [*Tensor*](#max.driver.Tensor) ### `inplace_copy_from()` {#max.driver.Tensor.inplace_copy_from} > inplace\_copy\_from(src) Copy the contents of another tensor into this one. These tensors may be on different devices. Requires that both tensors are contiguous and have same size. **Parameters:** **src** ([`Tensor`](#max.driver.Tensor) ) **Return type:** None ### `is_contiguous` {#max.driver.Tensor.is_contiguous} > *property* is\_contiguous Whether or not tensor is contiguously allocated in memory. Returns false if the tensor is a non-contiguous slice. Currently, we consider certain situations that are contiguous as non-contiguous for the purposes of our engine, such as when a tensor has negative steps. ### `is_host` {#max.driver.Tensor.is_host} > *property* is\_host Whether or not tensor is host-resident. Returns false for GPU tensors, true for CPU tensors. ```python from max import driver from max.dtype import DType cpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.bfloat16, device=driver.CPU()) print(cpu_tensor.is_host) ``` ### `item` {#max.driver.Tensor.item} > item Returns the scalar value at a given location. Currently implemented only for zero-rank tensors. The return type is converted to a Python built-in type. ### `mmap()` {#max.driver.Tensor.mmap} > mmap(dtype, shape, mode='copyonwrite', offset=0) **Parameters:** * **filename** (`PathLike` `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **dtype** ([`DType`](dtype.md#max.dtype.DType) ) * **shape** (`ShapeType` `|` [`int`](https://docs.python.org/3/library/functions.html#int) ) * **mode** (`np._MemMapModeKind` ) ### `num_elements` {#max.driver.Tensor.num_elements} > *property* num\_elements Returns the number of elements in this tensor. Rank-0 tensors have 1 element by convention. ### `pinned` {#max.driver.Tensor.pinned} > *property* pinned Whether or not the underlying memory is pinned (page-locked). ### `rank` {#max.driver.Tensor.rank} > *property* rank Tensor rank. ### `scalar` {#max.driver.Tensor.scalar} > scalar *= \* ### `shape` {#max.driver.Tensor.shape} > *property* shape Shape of tensor. ### `stream` {#max.driver.Tensor.stream} > *property* stream Stream to which tensor is bound. ### `to` {#max.driver.Tensor.to} > to Overloaded function. 1. `to(self, device: max._core.driver.Device) -> Tensor` > Return a tensor that’s guaranteed to be on the given device. > The tensor is only copied if the requested device is different from the > device upon which the tensor is already resident. 2. `to(self, device: max._core.driver.DeviceStream) -> Tensor` > Return a tensor that’s guaranteed to be on the given device and associated > with the given stream. > The tensor is only copied if the requested device is different from the > device upon which the tensor is already resident. ### `to_numpy()` {#max.driver.Tensor.to_numpy} > to\_numpy() Converts the tensor to a numpy array. If the tensor is not on the host, an exception is raised. **Return type:** [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ### `view()` {#max.driver.Tensor.view} > view(dtype, shape=None) Return a new tensor with the given type and shape that shares the underlying memory. If the shape is not given, it will be deduced if possible, or a ValueError is raised. **Parameters:** * **dtype** ([`DType`](dtype.md#max.dtype.DType) ) * **shape** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `|` `None` ) **Return type:** [*Tensor*](#max.driver.Tensor) ### `zeros` {#max.driver.Tensor.zeros} > zeros *= \* ## `accelerator_api()` {#max.driver.accelerator_api} > max.driver.accelerator\_api() Returns the API used to program the accelerator. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ## `devices_exist()` {#max.driver.devices_exist} > max.driver.devices\_exist(devices) Identify if devices exist. **Parameters:** **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`DeviceSpec`](#max.driver.DeviceSpec) `]` ) **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ## `load_devices()` {#max.driver.load_devices} > max.driver.load\_devices(device\_specs) Initialize and return a list of devices, given a list of device specs. **Parameters:** **device\_specs** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`DeviceSpec`](#max.driver.DeviceSpec) `]` ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Device*](#max.driver.Device)] ## `scan_available_devices()` {#max.driver.scan_available_devices} > max.driver.scan\_available\_devices() Returns all accelerators if available, else return cpu. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*DeviceSpec*](#max.driver.DeviceSpec)] --- ## DriverVersion `struct DriverVersion` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, value: List[String])` ### `major` `major(self) -> Int` ### `minor` `minor(self) -> Int` ### `patch` `patch(self) -> Int` ### `__str__` `__str__(self) -> String` --- ## dtype Provides data type definitions for tensors in MAX Engine. These data types are essential for defining the precision and memory layout of tensor data when working with machine learning models. This module defines the [`DType`](#max.dtype.DType) enum, which represents all supported tensor data types in MAX Engine, including: * Integer types (signed and unsigned): `int8` | `uint8` | `int16` | `uint16` | `int32` | `uint32` | `int64` | `uint64` * Floating-point types: `float8` variants | `float16` | `bfloat16` | `float32` | `float64` * Boolean type The module also provides utilities for converting between MAX Engine data types and [NumPy dtypes](https://numpy.org/doc/stable/user/basics.types.html), making it easy to interoperate with the NumPy ecosystem. ```python import numpy as np from max.dtype import DType tensor = np.zeros((2, 3), dtype=DType.float32.to_numpy()) # Convert NumPy dtype to MAX DType array = np.ones((4, 4), dtype=np.float16) max_dtype = DType.from_numpy(array.dtype) # Check properties of data types is_float = DType.float32.is_float() # True is_int = DType.int64.is_integral() # True size = DType.float64.size_in_bytes # 8 ``` ## `DType` {#max.dtype.DType} > *class* max.dtype.DType(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) The tensor data type. ### `align` {#max.dtype.DType.align} > *property* align Returns the alignment of the dtype. ### `bfloat16` {#max.dtype.DType.bfloat16} > bfloat16 *= 71* ### `bool` {#max.dtype.DType.bool} > bool *= 1* ### `float16` {#max.dtype.DType.float16} > float16 *= 70* ### `float32` {#max.dtype.DType.float32} > float32 *= 72* ### `float64` {#max.dtype.DType.float64} > float64 *= 73* ### `float8_e4m3fn` {#max.dtype.DType.float8_e4m3fn} > float8\_e4m3fn *= 66* ### `float8_e4m3fnuz` {#max.dtype.DType.float8_e4m3fnuz} > float8\_e4m3fnuz *= 67* ### `float8_e5m2` {#max.dtype.DType.float8_e5m2} > float8\_e5m2 *= 68* ### `float8_e5m2fnuz` {#max.dtype.DType.float8_e5m2fnuz} > float8\_e5m2fnuz *= 69* ### `from_numpy()` {#max.dtype.DType.from_numpy} > from\_numpy() Converts a NumPy dtype to the corresponding DType. **Parameters:** **dtype** (`np.dtype` ) – The NumPy dtype to convert. **Returns:** The corresponding DType enum value. **Return type:** [DType](#max.dtype.DType) **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the input dtype is not supported. ### `from_torch()` {#max.dtype.DType.from_torch} > from\_torch() **Parameters:** **dtype** (`dtype` ) **Return type:** [*DType*](#max.dtype.DType) ### `int16` {#max.dtype.DType.int16} > int16 *= 137* ### `int32` {#max.dtype.DType.int32} > int32 *= 139* ### `int64` {#max.dtype.DType.int64} > int64 *= 141* ### `int8` {#max.dtype.DType.int8} > int8 *= 135* ### `is_float` {#max.dtype.DType.is_float} > is\_float Returns true if the dtype is floating point. ### `is_float8` {#max.dtype.DType.is_float8} > is\_float8 Returns true if the dtype is any variant of float8. ### `is_half` {#max.dtype.DType.is_half} > is\_half Returns true if the dtype is half-precision floating point. ### `is_integral` {#max.dtype.DType.is_integral} > is\_integral Returns true if the dtype is an integer. ### `size_in_bytes` {#max.dtype.DType.size_in_bytes} > *property* size\_in\_bytes Returns the size of the dtype in bytes. ### `to_numpy()` {#max.dtype.DType.to_numpy} > to\_numpy() Converts this `DType` to the corresponding NumPy dtype. **Returns:** The corresponding NumPy dtype object. **Return type:** [DType](#max.dtype.DType) **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the dtype is not supported. ### `to_torch()` {#max.dtype.DType.to_torch} > to\_torch() **Parameters:** **dtype** ([`DType`](#max.dtype.DType) ) **Return type:** *dtype* ### `uint16` {#max.dtype.DType.uint16} > uint16 *= 136* ### `uint32` {#max.dtype.DType.uint32} > uint32 *= 138* ### `uint64` {#max.dtype.DType.uint64} > uint64 *= 140* ### `uint8` {#max.dtype.DType.uint8} > uint8 *= 134* --- ## dtype Implements the DType class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`DType`](/mojo/stdlib/builtin/dtype/DType): Represents DType and provides methods for working with it. --- ## DType `@register_passable(trivial)` `struct DType` Represents DType and provides methods for working with it. ## Fields * ​value (`dtype`): The underlying storage for the DType value. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Hashable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `bfloat16` `alias bfloat16` Represents a brain floating point value whose bitwidth is 16. ### `bool` `alias bool` Represents a boolean data type. ### `float16` `alias float16` Represents an IEEE754-2008 `binary16` floating point value. ### `float32` `alias float32` Represents an IEEE754-2008 `binary32` floating point value. ### `float64` `alias float64` Represents an IEEE754-2008 `binary64` floating point value. ### `float8_e3m4` `alias float8_e3m4` Represents an 8-bit e3m4 floating point format, encoded as `seeemmmm`: - (s)ign: 1 bit - (e)xponent: 3 bits - (m)antissa: 4 bits - exponent bias: 3 - nan: 00111111, 11111111 - -0: 10000000 - fn: finite (no inf or -inf encodings) ### `float8_e4m3fn` `alias float8_e4m3fn` Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1). This type is named differently across libraries and vendors, for example: * Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`. * OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`. In these contexts, they are all referring to the same finite type specified in the OFP8 standard above, encoded as `seeeemmm`: * (s)ign: 1 bit * (e)xponent: 4 bits * (m)antissa: 3 bits * exponent bias: 7 * nan: 01111111, 11111111 * -0: 10000000 * fn: finite (no inf or -inf encodings) ### `float8_e4m3fnuz` `alias float8_e4m3fnuz` Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `float8_e5m2` `alias float8_e5m2` Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000 ### `float8_e5m2fnuz` `alias float8_e5m2fnuz` Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `index` `alias index` Represents an integral type whose bitwidth is the maximum integral value on the system. ### `int128` `alias int128 = si128` Represents a signed integer type whose bitwidth is 128. ### `int16` `alias int16` Represents a signed integer type whose bitwidth is 16. ### `int256` `alias int256 = si256` Represents a signed integer type whose bitwidth is 256. ### `int32` `alias int32` Represents a signed integer type whose bitwidth is 32. ### `int64` `alias int64` Represents a signed integer type whose bitwidth is 64. ### `int8` `alias int8` Represents a signed integer type whose bitwidth is 8. ### `invalid` `alias invalid` Represents an invalid or unknown data type. ### `tensor_float32` `alias tensor_float32` Represents a special floating point format supported by NVIDIA Tensor Cores, with the same range as float32 and reduced precision (>=10 bits). Note that this dtype is only available on NVIDIA GPUs. ### `type` `alias type = dtype` ### `uint128` `alias uint128 = ui128` Represents an unsigned integer type whose bitwidth is 128. ### `uint16` `alias uint16` Represents an unsigned integer type whose bitwidth is 16. ### `uint256` `alias uint256 = ui256` Represents an unsigned integer type whose bitwidth is 256. ### `uint32` `alias uint32` Represents an unsigned integer type whose bitwidth is 32. ### `uint64` `alias uint64` Represents an unsigned integer type whose bitwidth is 64. ### `uint8` `alias uint8` Represents an unsigned integer type whose bitwidth is 8. ## Methods ### `__init__` `@implicit` `__init__(value: dtype) -> Self` Construct a DType from MLIR dtype. **Args:** * ​value (`dtype`): The MLIR dtype. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares one DType to another for equality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** True if the DTypes are the same and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares one DType to another for inequality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** False if the DTypes are the same and True otherwise. ### `__is__` `__is__(self, rhs: Self) -> Bool` Compares one DType to another for equality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** True if the DTypes are the same and False otherwise. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Compares one DType to another for inequality. **Args:** * ​rhs (`Self`): The DType to compare against. **Returns:** True if the DTypes are the same and False otherwise. ### `copy` `copy(self) -> Self` Copy this DType. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Gets the name of the DType. **Returns:** The name of the dtype. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this dtype to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Gets the representation of the DType e.g. `"DType.float32"`. **Returns:** The representation of the dtype. ### `get_value` `get_value(self) -> dtype` Gets the associated internal kgen.dtype value. **Returns:** The kgen.dtype value. ### `__hash__` `__hash__(self) -> UInt` Return a 64-bit hash for this `DType` value. **Returns:** A 64-bit integer hash of this `DType` value. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this `DType` value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `is_unsigned` `is_unsigned(self) -> Bool` Returns True if the type parameter is unsigned and False otherwise. **Returns:** Returns True if the input type parameter is unsigned. ### `is_signed` `is_signed(self) -> Bool` Returns True if the type parameter is signed and False otherwise. **Returns:** Returns True if the input type parameter is signed. ### `is_integral` `is_integral(self) -> Bool` Returns True if the type parameter is an integer and False otherwise. **Returns:** Returns True if the input type parameter is an integer. ### `is_floating_point` `is_floating_point(self) -> Bool` Returns True if the type parameter is a floating-point and False otherwise. **Returns:** Returns True if the input type parameter is a floating-point. ### `is_float8` `is_float8(self) -> Bool` Returns True if the dtype is a 8bit-precision floating point type, e.g. float8\_e5m2, float8\_e5m2fnuz, float8\_e4m3fn and float8\_e4m3fnuz. **Returns:** True if the dtype is a 8bit-precision float, false otherwise. ### `is_half_float` `is_half_float(self) -> Bool` Returns True if the dtype is a half-precision floating point type, e.g. either fp16 or bf16. **Returns:** True if the dtype is a half-precision float, false otherwise.. ### `is_numeric` `is_numeric(self) -> Bool` Returns True if the type parameter is numeric (i.e. you can perform arithmetic operations on). **Returns:** Returns True if the input type parameter is either integral or floating-point. ### `sizeof` `sizeof(self) -> Int` Returns the size in bytes of the current DType. **Returns:** Returns the size in bytes of the current DType. ### `bitwidth` `bitwidth(self) -> Int` Returns the size in bits of the current DType. **Returns:** Returns the size in bits of the current DType. ### `dispatch_integral` `dispatch_integral[: origin.set, //, func: fn[DType]() capturing -> None](self)` Dispatches an integral function corresponding to the current DType. **Constraints:** DType must be integral. **Parameters:** * ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch. ### `dispatch_floating` `dispatch_floating[: origin.set, //, func: fn[DType]() capturing -> None](self)` Dispatches a floating-point function corresponding to the current DType. **Constraints:** DType must be floating-point or integral. **Parameters:** * ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch. ### `dispatch_arithmetic` `dispatch_arithmetic[: origin.set, //, func: fn[DType]() capturing -> None](self)` Dispatches a function corresponding to the current DType. **Parameters:** * ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch. ### `__mlir_type` `__mlir_type(self) -> !kgen.deferred` Returns the MLIR type of the current DType as an MLIR type. **Returns:** The MLIR type of the current DType. --- ## dual_gemm `dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)` --- ## dual_gemm ## Aliases ### `binary_fn_type` `alias binary_fn_type = fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1]` ## Functions * [​`config_in_smem`](./config_in_smem): * [​`dual_gemm`](./dual_gemm): * [​`dual_gemv`](./dual_gemv): * [​`dual_gemv_kernel`](./dual_gemv_kernel): * [​`multistage_dual_gemm`](./multistage_dual_gemm): * [​`multistage_dual_gemm_kernel`](./multistage_dual_gemm_kernel): * [​`multistage_dual_mma`](./multistage_dual_mma): * [​`swilu`](./swilu): * [​`swishGLU`](./swishGLU): Reference: GLU Variants Improve Transformer by Noam Shazeer The implementation follows cutlass, using one kernel invocation and writing to the destination once. --- ## dual_gemv `dual_gemv[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)` --- ## dual_gemv_kernel `dual_gemv_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape])` --- ## dynamic `dynamic(d: Int) -> ValueOrUnknown` Creates a dynamic dimension with runtime value. **Args:** * ​d (`Int`): Runtime dimension value. **Returns:** `ValueOrUnknown` - A dynamic dimension with the given value. --- ## DynamicInt `@register_passable(trivial)` `struct DynamicInt` ## Fields * ​value (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Intable`, `Movable`, `OptionallyStaticInt`, `UnknownDestructibility` ## Aliases ### `static_value` `alias static_value = OptionalReg[Int]({:i1 0, 1})` ## Methods ### `__init__` `__init__(value: Int) -> Self` ### `__int__` `__int__(self) -> Int` ### `as_uint32` `as_uint32(self) -> SIMD[uint32, 1]` --- ## DynamicTensor `struct DynamicTensor[type: DType, rank: Int]` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `Type` `alias Type = ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]` --- ## elect_one_sync `elect_one_sync() -> Bool` Elects a single thread within a warp to perform an operation. Note: * Only supported on NVIDIA SM90+ GPUs. * Maps directly to the `elect.sync` instruction in CUDA PTX. * Useful for having a single thread perform an operation while maintaining warp synchronization. **Returns:** True for the elected thread, False for all other threads in the warp. --- ## element Provides element-based access to memory using layout-driven vectorization. This module implements efficient memory access patterns for multi-dimensional data using the layout system. It provides abstractions for loading and storing data with specific memory layouts, enabling vectorized operations that respect the underlying memory organization. Key components: * `Element`: A wrapper around SIMD types that provides layout-driven vectorized operations * `MemoryElement`: Represents data in memory organized according to a specific layout These components enable efficient tensor operations by ensuring memory accesses follow optimal patterns defined by the layout system. ## Structs * [​`Element`](./Element): A wrapper around SIMD types that provides layout-driven vectorized operations. * [​`MemoryElement`](./MemoryElement): Represents data in memory organized according to a specific layout. --- ## Element `struct Element[dtype: DType, layout: Layout, /, index_type: DType = _get_index_type(layout)]` A wrapper around SIMD types that provides layout-driven vectorized operations. The `Element` struct extends SIMD types with layout-aware load and store operations, enabling efficient vectorized access to multi-dimensional data. It maps between logical tensor coordinates and physical memory locations according to the specified layout. ## Parameters * ​dtype (`DType`): The data type of the elements. * ​layout (`Layout`): The memory layout describing how elements are organized. * ​index\_type (`DType`): The integer type of the index pointing to each element. ## Fields * ​element\_data (`SIMD[dtype, layout.size()]`): The actual SIMD data stored in this element. This field contains the vectorized data values that can be processed efficiently using SIMD operations. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout information for memory access patterns. This field stores the layout information needed to map between logical tensor coordinates and physical memory locations, supporting both compile-time and runtime-determined access patterns. ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `element_data_type` `alias element_data_type = SIMD[dtype, layout.size()]` The SIMD type used to store and process the element data. This type alias defines a SIMD vector with the specified data type and size matching the layout's total element count, enabling efficient vectorized operations. ## Methods ### `__init__` `@implicit` `__init__(out self, element_data: SIMD[dtype, layout.size()])` Initializes an Element with the given SIMD data. **Args:** * ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with. `__init__(out self, element_data: SIMD[dtype, layout.size()], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])` Initializes an Element with the given SIMD data and runtime layout. **Args:** * ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. ### `load` `static load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self` Loads data from memory according to the specified layout. This method loads data from memory using the layout information to determine the memory access pattern. It supports both rank-1 and rank-2 layouts with various stride patterns, optimizing for contiguous memory access when possible. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. **Returns:** A new `Element` containing the loaded data. ### `masked_load` `static masked_load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self` Loads data from memory with masking for partial loads. This method loads data from memory using the layout information, but also handles cases where the runtime dimensions are smaller than the static layout dimensions. It ensures that only valid memory locations are accessed. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. **Returns:** A new `Element` containing the loaded data, with zeros in positions beyond the runtime dimensions. ### `store` `store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])` Stores element data to memory according to the specified layout. This method performs a layout-aware store operation, writing data to memory following the access patterns defined by the layout. It optimizes memory writes based on the layout's stride patterns to maximize performance. The method handles different memory layout patterns: * For rank-1 tensors with contiguous memory (stride=1), it uses vectorized stores * For rank-2 tensors with contiguous rows or columns, it uses optimized slice-based stores * For non-contiguous memory layouts, it performs element-by-element stores Unlike `masked_store()`, this method assumes the full static dimensions will be written and does not perform runtime dimension boundary checking. Note: This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Mutable pointer to the memory location where data will be stored. ### `masked_store` `masked_store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])` Stores element data to memory with masking for partial stores. This method performs a layout-aware store operation with boundary checking. It ensures that only valid memory locations are written to when the runtime dimensions are smaller than the static layout dimensions, preventing out-of-bounds memory access. The method optimizes for different memory layouts: * For contiguous memory (stride=1), it uses vectorized stores when possible * For non-contiguous memory, it performs element-by-element stores * For all patterns, it respects runtime dimension bounds Note: This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer to the memory location where data will be stored. ### `__str__` `__str__(self) -> String` Returns a string representation of the element. **Returns:** A string representation of the element's data. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the element to the specified writer. **Parameters:** * ​W (`Writer`): Type parameter representing a Writer implementation. **Args:** * ​writer (`W`): The writer to output the element representation to. --- ## elementwise `elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`Int`): The shape of the buffer. `elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type])` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​rank (`Int`): The rank of the buffer. * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer. `elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int, context: DeviceContext)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`Int`): The shape of the buffer. * ​context (`DeviceContext`): The device context to use. `elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContext)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​rank (`Int`): The rank of the buffer. * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer. * ​context (`DeviceContext`): The device context to use. `elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContextPtr)` Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. **Parameters:** * ​rank (`Int`): The rank of the buffer. * ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function. * ​simd\_width (`Int`): The SIMD vector width to use. * ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer. * ​context (`DeviceContextPtr`): The device context to use. --- ## elementwise_epilogue_c_tile `elementwise_epilogue_c_tile[: origin.set, //, simd_width: Int, type: DType, origin: MutableOrigin, c_shape: DimList, func: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None](offset: GemmShape, tile_len: GemmShape, c: NDBuffer[type, 2, origin, c_shape])` --- ## elu `elu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the ELU operation on. **Returns:** The result of the ELU operation. --- ## embedding The `embedding` module provides classes for mapping integer indices (like token IDs) to dense vector representations. These embedding operations are fundamental building blocks for natural language processing, recommendation systems, and other tasks involving discrete tokens. * `Embedding`: Basic embedding lookup table for simple use cases * `EmbeddingV2`: Enhanced embedding with device placement control and improved memory management * `VocabParallelEmbedding`: Distributed embedding that shards the vocabulary across multiple devices for large embedding tables Here’s an example demonstrating how to use embeddings: ```python import max.nn as nn from max.graph import Graph, ops, DeviceRef from max.dtype import DType import numpy as np with Graph(name="embedding_example") as graph: # Define dimensions batch_size = 4 seq_length = 16 vocab_size = 10000 hidden_dim = 256 # Create input tensor of token indices input_data = np.random.randint(0, vocab_size, (batch_size, seq_length), dtype=np.int32) input_indices = ops.constant(input_data, dtype=DType.int32, device=DeviceRef.CPU()) # Create embedding layer embedding = nn.EmbeddingV2( vocab_size=vocab_size, hidden_dim=hidden_dim, dtype=DType.float32, device=DeviceRef.GPU(), name="token_embeddings" ) # Look up embeddings for input indices embeddings = embedding(input_indices) print(f"Embedding output shape: {embeddings.shape}") # Embedding output shape: [Dim(4), Dim(16), Dim(256)] ``` ## `Embedding` {#max.nn.embedding.Embedding} > *class* max.nn.embedding.Embedding(vocab\_size, hidden\_dim, dtype, device, quantization\_encoding=None, name=None) A lookup table for embedding integer indices into dense vectors. This layer maps each integer index to a dense vector of fixed size. Embedding weights are stored on the CPU but are moved to the specified device during the model init phase. Example: ```python embedding_layer = Embedding( vocab_size=1000, hidden_dim=256, dtype=DType.float32, device=DeviceRef.GPU(), name="embeddings", ) token_indices: TensorValueLike embeddings = embedding_layer(token_indices) ``` Initializes the embedding layer with the given arguments. **Parameters:** * **vocab\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of unique items in the vocabulary. Indices must be in the range `[0, vocab_size)`. * **hidden\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of each embedding vector. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type of the embedding weights. * **device** (`DeviceRef` ) – The device where embedding lookups are executed. Model init transfers the initially CPU-resident weights to this device. * **name** (`Optional` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – The name identifier for the embedding weight matrix. * **quantization\_encoding** (`Optional` `[` [`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `]` ) ### `device` {#max.nn.embedding.Embedding.device} > device\*: DeviceRef\* The device on which embedding lookup is performed. ### `weight` {#max.nn.embedding.Embedding.weight} > weight\*: [Weight](../graph/Weight.md#max.graph.Weight)\* The embedding weight matrix stored on the CPU. Model init moves weights to the device specified in [`device`](#max.nn.embedding.Embedding.device). ## `EmbeddingV1` {#max.nn.embedding.EmbeddingV1} > *class* max.nn.embedding.EmbeddingV1(weights, device) A lookup table for embedding integer indices into dense vectors. Deprecated: Use Embedding instead. **Parameters:** * **weights** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **device** (`DeviceRef` ) ### `device` {#max.nn.embedding.EmbeddingV1.device} > device\*: DeviceRef\* ### `weights` {#max.nn.embedding.EmbeddingV1.weights} > weights\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ## `VocabParallelEmbedding` {#max.nn.embedding.VocabParallelEmbedding} > *class* max.nn.embedding.VocabParallelEmbedding(vocab\_size, hidden\_dim, dtype, devices, quantization\_encoding=None, name=None) A lookup table for embedding integer indices into dense vectors. This layer works like nn.Embedding except the embedding table is sharded on the vocabulary dimension across all devices. Example: ```python embedding_layer = VocabParallelEmbedding( vocab_size=1000, hidden_dim=256, dtype=DType.float32, device=[DeviceRef.GPU(0), DeviceRef.GPU(1)], name="embeddings", ) # Token indices of shape: [batch, ..., num_indices]. token_indices: TensorValueLike embeddings = embedding_layer(token_indices) ``` **Parameters:** * **vocab\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of unique items in the vocabulary. Indices must be in the range `[0, vocab_size)`. * **hidden\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of each embedding vector. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type of the embedding weights. * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) – The devices where embedding lookups are executed. Model init transfers the initially CPU-resident weights to this device. * **name** (`Optional` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – The name identifier for the embedding weight matrix. * **quantization\_encoding** (`Optional` `[` [`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `]` ) --- ## Embedding An embedding (also known as a "vector embedding") is a numerical representation of information in a high-dimensional vector space. For example, a token embedding (or word embedding) encodes the meaning of words for use in large language models (LLMs). Because artificial neural networks (AI models) are a sequence of mathematical operations, they require numerical structures as input. Vector embeddings are numerical structures that provide a way to express a wide range of complex concepts. They can be used to capture information about all sorts of things, including words, groups of words, sounds, images, and more. For example, [tokenizing](tokenization.mdx) a word like "bank" into a simple number can't encode the different meanings in "bank loan" and "river bank." By converting the token into a high-dimensional vector, we can encode (or "embed") a variety of word meanings that help the model understand word relationships using a notion of closeness along various vector dimensions (expressed through [euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)). In this way, when a model encounters the embedding for the word "bank," it can recognize the relationship it has with nearby words such as "loan" or "river," based on the closeness they each have to each other on different vector dimensions (perhaps a "finance" dimension vs a "geography" dimension that were learned during training). Although word embeddings are a type of static embedding that encode the meaning of individual words as input to an LLM, an LLM also builds its own embeddings that are hidden inside the model. For example, as an LLM tries to understand the relationship between each word from an input sequence, it compresses more information into each token embedding based on the attention scores computed in the [self-attention layer](self-attention.mdx). :::note Embedding models Whereas the token embeddings described above use a vector space to represent the meaning of individual tokens, the output from an embedding model uses a vector space to represent the meaning of the input data (a document) as a whole. In this way, an embedding model allows you to programmatically search and compare different documents by analyzing their corresponding embeddings, which can reveal nuanced meaning and semantics far beyond what a pure text comparison can achieve. ::: --- ## EnableState `@register_passable(trivial)` `struct EnableState` ## Fields * ​code (`SIMD[int32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DISABLED` `alias DISABLED = EnableState(__init__[__mlir_type.!pop.int_literal](0))` Feature disabled ### `ENABLED` `alias ENABLED = EnableState(__init__[__mlir_type.!pop.int_literal](1))` Feature enabled ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## engine The APIs in this module allow you to run inference with MAX Engine—a graph compiler and runtime that accelerates your AI models on a wide variety of hardware. ## `InferenceSession` {#max.engine.InferenceSession} > *class* max.engine.InferenceSession(num\_threads=None, devices=None, \*, custom\_extensions=None) Manages an inference session in which you can load and run models. You need an instance of this to load a model as a [`Model`](#max.engine.Model) object. For example: ```python session = engine.InferenceSession() model_path = Path('bert-base-uncased') model = session.load(model_path) ``` **Parameters:** * **num\_threads** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – Number of threads to use for the inference session. This defaults to the number of physical cores on your machine. * **devices** (`Iterable` `[` [`Device`](driver.md#max.driver.Device) `]` `|` `None` ) – A list of devices on which to run inference. Default is the host CPU only. * **custom\_extensions** (`CustomExtensionsType` `|` `None` ) – The extensions to load for the model. Supports paths to .mojopkg custom ops, .so custom op libraries for PyTorch and .pt torchscript files for torch metadata libraries. Supports `TorchMetadata` and `torch.jit.ScriptModule` objects for torch metadata libraries without serialization. ### `devices` {#max.engine.InferenceSession.devices} > *property* devices\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[Device](driver.md#max.driver.Device)]\* A list of available devices. ### `gpu_profiling()` {#max.engine.InferenceSession.gpu_profiling} > gpu\_profiling(mode) Enables end to end gpu profiling configuration. **Parameters:** **mode** (`GPUProfilingMode` ) ### `load()` {#max.engine.InferenceSession.load} > load(model, \*, custom\_extensions=None, custom\_ops\_path=None, weights\_registry=None) Loads a trained model and compiles it for inference. **Parameters:** * **model** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `|` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) – Path to a model. * **custom\_extensions** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `|` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `|` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `|` `None` ) – The extensions to load for the model. Supports paths to .mojopkg custom ops. * **custom\_ops\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) – The path to your custom ops Mojo package. Deprecated, use `custom_extensions` instead. * **weights\_registry** ([`Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`DLPackArray`](driver.md#max.driver.DLPackArray) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,` [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]` `]` `|` `None` ) – A mapping from names of model weights’ names to their values. The values are currently expected to be dlpack arrays. If an array is a read-only numpy array, the user must ensure that its lifetime extends beyond the lifetime of the model. **Returns:** The loaded model, compiled and ready to execute. **Raises:** [**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError) – If the path provided is invalid. **Return type:** [*Model*](#max.engine.Model) ### `reset_stats_report()` {#max.engine.InferenceSession.reset_stats_report} > reset\_stats\_report() Clears all entries in stats\_report. **Return type:** None ### `set_mojo_assert_level()` {#max.engine.InferenceSession.set_mojo_assert_level} > set\_mojo\_assert\_level(level) Sets which mojo asserts are kept in the compiled model. **Parameters:** **level** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `AssertLevel` ) ### `set_mojo_log_level()` {#max.engine.InferenceSession.set_mojo_log_level} > set\_mojo\_log\_level(level) Sets the verbosity of mojo logging in the compiled model. **Parameters:** **level** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `LogLevel` ) ### `set_split_k_reduction_precision()` {#max.engine.InferenceSession.set_split_k_reduction_precision} > set\_split\_k\_reduction\_precision(precision) Sets the accumulation precision for split k reductions in large matmuls. **Parameters:** **precision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `SplitKReductionPrecision` ) ### `stats_report` {#max.engine.InferenceSession.stats_report} > *property* stats\_report\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Any](https://docs.python.org/3/library/typing.html#typing.Any)]\* Metadata about model compilation (PyTorch only). Prints a list of “fallback ops”, which are ops that could not be lowered to our internal dialect MO. Fallback ops have to be executed using the original framework (i.e. PyTorch), which makes the model much slower. This function is a good starting point for debugging model performance. ## `Model` {#max.engine.Model} > *class* max.engine.Model A loaded model that you can execute. Do not instantiate this class directly. Instead, create it with [`InferenceSession`](#max.engine.InferenceSession). ### `__call__()` {#max.engine.Model.__call} > \_\_call\_\_(\*args, \*\*kwargs) Call self as a function. **Parameters:** * **self** ([`Model`](#max.engine.Model) ) * **args** ([`DLPackArray`](driver.md#max.driver.DLPackArray) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,` [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]` `|` [`Tensor`](driver.md#max.driver.Tensor) `|` [`MojoValue`](#max.engine.MojoValue) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`bool`](https://docs.python.org/3/library/functions.html#bool) `|` [`generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic) ) * **kwargs** ([`DLPackArray`](driver.md#max.driver.DLPackArray) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,` [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]` `|` [`Tensor`](driver.md#max.driver.Tensor) `|` [`MojoValue`](#max.engine.MojoValue) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`bool`](https://docs.python.org/3/library/functions.html#bool) `|` [`generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic) ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Tensor*](driver.md#max.driver.Tensor) | [*MojoValue*](#max.engine.MojoValue)] ### `execute()` {#max.engine.Model.execute} > execute(\*args) **Parameters:** * **self** ([`Model`](#max.engine.Model) ) * **args** ([`DLPackArray`](driver.md#max.driver.DLPackArray) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,` [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]` `|` [`Tensor`](driver.md#max.driver.Tensor) `|` [`MojoValue`](#max.engine.MojoValue) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`bool`](https://docs.python.org/3/library/functions.html#bool) `|` [`generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic) ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Tensor*](driver.md#max.driver.Tensor) | [*MojoValue*](#max.engine.MojoValue)] ### `execute_legacy()` {#max.engine.Model.execute_legacy} > execute\_legacy(\*\*kwargs) **Parameters:** * **self** ([`Model`](#max.engine.Model) ) * **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [dict](https://docs.python.org/3/library/stdtypes.html#dict) | [list](https://docs.python.org/3/library/stdtypes.html#list) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)] ### `input_metadata` {#max.engine.Model.input_metadata} > *property* input\_metadata Metadata about the model’s input tensors, as a list of [`TensorSpec`](#max.engine.TensorSpec) objects. For example, you can print the input tensor names, shapes, and dtypes: ```python for tensor in model.input_metadata: print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}') ``` ### `output_metadata` {#max.engine.Model.output_metadata} > *property* output\_metadata Metadata about the model’s output tensors, as a list of [`TensorSpec`](#max.engine.TensorSpec) objects. For example, you can print the output tensor names, shapes, and dtypes: ```python for tensor in model.output_metadata: print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}') ``` ## `MojoValue` {#max.engine.MojoValue} > *class* max.engine.MojoValue This is work in progress and you should ignore it for now. ## `TensorSpec` {#max.engine.TensorSpec} > *class* max.engine.TensorSpec(self, shape: [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)] | [None](https://docs.python.org/3/library/constants.html#None), dtype: [max.\_core.dtype.DType](dtype.md#max.dtype.DType), name: [str](https://docs.python.org/3/library/stdtypes.html#str)) Defines the properties of a tensor, including its name, shape and data type. For usage examples, see [`Model.input_metadata`](#max.engine.Model.input_metadata). **Parameters:** * **shape** – The tensor shape. * **dtype** – The tensor data type. * **name** – The tensor name. ### `dtype` {#max.engine.TensorSpec.dtype} > *property* dtype A tensor data type. ### `name` {#max.engine.TensorSpec.name} > *property* name A tensor name. ### `shape` {#max.engine.TensorSpec.shape} > *property* shape The shape of the tensor as a list of integers. If a dimension size is unknown/dynamic (such as the batch size), its value is `None`. --- ## entrypoints ## `LLM` {#max.entrypoints.llm.LLM} > *class* max.entrypoints.llm.LLM(pipeline\_config) A high level interface for interacting with LLMs. **Parameters:** **pipeline\_config** ([`PipelineConfig`](pipelines/config.md#max.pipelines.lib.config.PipelineConfig) ) ### `generate()` {#max.entrypoints.llm.LLM.generate} > generate(prompts, max\_new\_tokens=100, use\_tqdm=True) Generates text completions for the given prompts. **Parameters:** * **prompts** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – The input string or list of strings to generate completions for. * **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – The maximum number of tokens to generate in the response. * **use\_tqdm** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to display a progress bar during generation. **Returns:** A list of generated text completions corresponding to each input prompt. **Raises:** * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If prompts is empty or contains invalid data. * [**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError) – If the model fails to generate completions. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)] --- ## env Provides functions for working with environment variables. You can import these APIs from the `os` package. For example: ```mojo from os import setenv ``` ## Functions * [​`getenv`](/mojo/stdlib/os/env/getenv): Returns the value of the given environment variable. * [​`setenv`](/mojo/stdlib/os/env/setenv): Changes or adds an environment variable. * [​`unsetenv`](/mojo/stdlib/os/env/unsetenv): Unsets an environment variable. --- ## env_get_bool `env_get_bool[name: StringSlice[StaticConstantOrigin]]() -> Bool` Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. **Returns:** An boolean parameter value. `env_get_bool[name: StringSlice[StaticConstantOrigin], default: Bool]() -> Bool` Try to get an bool-valued define. If the name is not defined, return a default value instead. The boolean must be either `True` or `False`. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`Bool`): The default value to use. **Returns:** An bool parameter value. --- ## env_get_dtype `env_get_dtype[name: StringSlice[StaticConstantOrigin], default: DType]() -> DType` Try to get an DType-valued define. If the name is not defined, return a default value instead. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`DType`): The default value to use. **Returns:** An DType parameter value. --- ## env_get_int `env_get_int[name: StringSlice[StaticConstantOrigin]]() -> Int` Try to get an integer-valued define. Compilation fails if the name is not defined. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. **Returns:** An integer parameter value. `env_get_int[name: StringSlice[StaticConstantOrigin], default: Int]() -> Int` Try to get an integer-valued define. If the name is not defined, return a default value instead. Example: ```mojo from sys.param_env import env_get_int def main(): alias number = env_get_int[ "favorite_number", 1 # Default value ]() parametrized[number]() fn parametrized[num: Int](): print(num) ``` If the program is `app.mojo`: * `mojo run -D favorite_number=2 app.mojo` * `mojo run -D app.mojo` Note: useful for parameterizing SIMD vector sizes. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`Int`): The default value to use. **Returns:** An integer parameter value. --- ## env_get_string `env_get_string[name: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]` Try to get a string-valued define. Compilation fails if the name is not defined. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. **Returns:** A string parameter value. `env_get_string[name: StringSlice[StaticConstantOrigin], default: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]` Try to get a string-valued define. If the name is not defined, return a default value instead. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the define. * ​default (`StringSlice[StaticConstantOrigin]`): The default value to use. **Returns:** A string parameter value. --- ## equality_comparable ## Traits * [​`EqualityComparable`](/mojo/stdlib/builtin/equality_comparable/EqualityComparable): A type which can be compared for equality with other instances of itself. --- ## EqualityComparable A type which can be compared for equality with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__eq__` `__eq__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are equal according to the type's definition of equality, False otherwise. ### `__ne__` `__ne__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are not equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are not equal according to the type's definition of equality, False otherwise. --- ## erf `erf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs the elementwise Erf on a SIMD vector. **Constraints:** The type must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform elementwise Erf on. **Returns:** The result of the elementwise Erf operation. --- ## erfc `erfc[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `erfc` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `erfc` of the input. --- ## error Implements the Error class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Error`](/mojo/stdlib/builtin/error/Error): This type represents an Error. --- ## Error `@register_passable` `struct Error` This type represents an Error. ## Fields * ​data (`UnsafePointer[SIMD[uint8, 1]]`): A pointer to the beginning of the string data being referenced. * ​loaded\_length (`Int`): The length of the string being referenced. Error instances conditionally own their error message. To reduce the size of the error instance we use the sign bit of the length field to store the ownership value. When loaded\_length is negative it indicates ownership and a free is executed in the destructor. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Default constructor. `@implicit` `__init__(value: StringLiteral[value]) -> Self` Construct an Error object with a given string literal. **Args:** * ​value (`StringLiteral[value]`): The error message. `@implicit` `__init__(src: String) -> Self` Construct an Error object with a given string. **Args:** * ​src (`String`): The error message. `@implicit` `__init__(src: StringSlice[origin]) -> Self` Construct an Error object with a given string ref. **Args:** * ​src (`StringSlice[origin]`): The error message. `__init__[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self` Construct an Error by concatenating a sequence of Writable arguments. **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​\*args (`*Ts`): A sequence of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Creates a deep copy of an existing error. **Args:** * ​existing (`Self`): The error to copy from. ### `__del__` `__del__(owned self)` Releases memory if allocated. ### `__bool__` `__bool__(self) -> Bool` Returns True if the error is set and false otherwise. **Returns:** True if the error object contains a value and False otherwise. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Converts the Error to string representation. **Returns:** A String of the error message. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this error to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Converts the Error to printable representation. **Returns:** A printable representation of the error message. ### `byte_length` `byte_length(self) -> Int` Get the length of the Error string in bytes. Notes: This does not include the trailing null terminator in the count. **Returns:** The length of the Error string in bytes. ### `unsafe_cstr_ptr` `unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1]]` Retrieves a C-string-compatible pointer to the underlying memory. The returned pointer is guaranteed to be NUL terminated, and not null. **Returns:** The pointer to the underlying memory. ### `as_string_slice` `as_string_slice(self) -> StringSlice[ImmutableAnyOrigin]` Returns a string slice of the data maybe owned by the Error. Notes: Since the data is not guaranteed to be owned by the Error, the resulting StringSlice is given an ImmutableAnyOrigin. **Returns:** A string slice pointing to the data maybe owned by the Error. --- ## Errors, error handling, and context managers This page discusses how to raise errors in Mojo programs and how to detect and handle error conditions. It also discusses how you can use context managers to allocate and release resources such as files correctly, even when error conditions occur. Finally, it shows you how to implement context managers for your own custom resources. ## Raise an error The `raise` statement raises an error condition in your program. You provide the `raise` statement with an [`Error`](/mojo/stdlib/builtin/error/Error) instance to indicate the type of error that occurred. For example: ```mojo raise Error("integer overflow") ``` As a convenience, you can instead provide an error message in the form of a [`String`](/mojo/stdlib/collections/string/string/String) or [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral) value, and `raise` automatically uses that to create an `Error` instance. So you can raise the same error condition as shown above by executing: ```mojo raise "integer overflow" ``` :::note Currently, Mojo does not support typed error conditions. All errors are instances of `Error`, and the only thing that distinguishes different error conditions is the error message that you provide. ::: An error interrupts the current execution flow of your program. If you provide an error handler (as described in [Handle an error](#handle-an-error)) in the current function, execution resumes with that handler. If the error isn't handled in the current function, it propagates to the calling function and so on. If an error isn't caught by any error handler, your program terminates with a non-zero exit code and prints the error message. For example: ```output Unhandled exception caught during execution: integer overflow ``` If a function you define using the `fn` keyword can raise an error, you must include the `raises` keyword in the function definition. For example: ```mojo fn incr(n: Int) raises -> Int: if n == Int.MAX: raise "inc: integer overflow" else: return n + 1 ``` If you don't include the `raises` keyword when defining a function with `fn`, then the function must explicitly handle any errors that might occur in code that it executes. For example: ```mojo # This function doesn't compile because of the unhandled error fn unhandled_error(n: Int): print(n, "+ 1 =", incr(n)) # This function compiles because it handles the possible error fn handled_error(n: Int): try: print(n, "+ 1 =", incr(n)) except e: print("Handled an error:", e) ``` In contrast, you **cannot** use the `raises` keyword when defining a function using the `def` keyword, because `def` always implies that the function might raise an error. So the following is equivalent to the `incr` function defined above with `fn`: ```mojo def incr(n: Int) -> Int: if n == Int.MAX: raise "inc: integer overflow" else: return n + 1 ``` ## Handle an error Mojo allows you to detect and handle error conditions using the `try-except` control flow structure, whose full syntax is: ```mojo try: # Code block to execute that might raise an error except : # Code block to execute if an error occurs else: # Code block to execute if no error occurs finally: # Final code block to execute in all circumstances ``` You must include one or both of the `except` and `finally` clauses. The `else` clause is optional. The `try` clause contains a code block to execute that might raise an error. If no error occurs, the entire code block executes. If an error occurs, execution of the code block stops at the point that the error is raised. Your program then continues with the execution of the `except` clause, if provided, or the `finally` clause. If the `except` clause is present, its code block executes only if an error occurred in the `try` clause. The `except` clause "consumes" the error that occurred in the `try` clause. You can then implement any error handling or recovery that's appropriate for your application. If you provide the name of a variable after the `except` keyword, then the `Error` instance is bound to the variable if an error occurs. The `Error` type implements the [`Writable`](/mojo/stdlib/utils/write/Writable) trait, so you can pass it as an argument to the [`print()`](/mojo/stdlib/builtin/io/print) function if you'd like to print its error message to the console. It also implements the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait, so you can construct a `String` with `String(error)` if you want to extract the error message as a `String` for further processing. If desired, you can re-raise an error condition from your `except` clause simply by executing a `raise` statement from within its code block. This can be either a new `Error` instance or, if you provided a variable name to capture the `Error` that occurred originally, you can re-raise that error. :::note Because Mojo does not currently support typed errors, a `try-except` control structure can include at most one `except` clause, which catches any `Error` raised. ::: If the `else` clause is present, its code block executes only if an error does not occur in the `try` clause. Note that the `else` clause is *skipped* if the `try` clause executes a `continue`, `break`, or `return` that exits from the `try` block. If the `finally` clause is present, its code block executes after the `try` clause and the `except` or `else` clause, if applicable. The `finally` clause executes even if one of the other code blocks exit by executing a `continue`, `break`, or `return` statement or by raising an error. The `finally` clause is often used to release resources used by the `try` clause (such as a file handle) regardless of whether or not an error occurred. As an example, consider the following program: ```mojo def incr(n: Int) -> Int: if n == Int.MAX: raise "inc: integer overflow" else: return n + 1 def main(): values = List(0, 1, Int.MAX) for value in values: try: print() print("try =>", value[]) if value[] == 1: continue result = StaticString("{} incremented is {}").format(value[], incr(value[])) except e: print("except =>", e) else: print("else =>", result) finally: print("finally => ====================") ``` Running this program generates the following output: ```output try => 0 else => 0 incremented is 1 finally => ==================== try => 1 finally => ==================== try => 9223372036854775807 except => inc: integer overflow finally => ==================== ``` ## Use a context manager A *context manager* is an object that manages resources such as files, network connections, and database connections. It provides a way to allocate resources and release them automatically when they are no longer needed, ensuring proper cleanup and preventing resource leaks even in the case of error conditions. As an example, consider reading data from a file. A naive approach might look like this: ```mojo # Obtain a file handle to read from storage f = open(input_file, "r") content = f.read() # Process the content as needed # Close the file handle f.close() ``` Calling [`close()`](/mojo/stdlib/builtin/file/FileHandle#close) releases the memory and other operating system resources associated with the opened file. If your program were to open many files without closing them, you could exhaust the resources available to your program and cause errors. The problem is even worse if you were writing to a file instead of reading from it, because the operating system might buffer the output in memory until the file is closed. If your program were to crash instead of exiting normally, that buffered data could be lost instead of being written to storage. The example above actually includes the call to `close()`, but it ignores the possibility that [`read()`](/mojo/stdlib/builtin/file/FileHandle#read) could raise an error, which would result in the program not executing the `close()`. To handle this scenario you could rewrite the code to use `try` like this: ```mojo # Obtain a file handle to read from storage f = open(input_file, "r") try: content = f.read() # Process the content as needed finally: # Ensure that the file handle is closed even if read() raises an error f.close() ``` However, the [`FileHandle`](/mojo/stdlib/builtin/file/FileHandle) struct returned by [`open()`](/mojo/stdlib/builtin/file/open) is a context manager. When used in conjunction with Mojo's `with` statement, a context manager ensures that the resources it manages are properly released at the end of the block, even if an error occurs. In the case of a `FileHandle`, that means that the call to `close()` takes place automatically. So you could rewrite the example above to take advantage of the context manager—and omit the explicit call to `close()`—like this: ```mojo with open(input_file, "r") as f: content = f.read() # Process the content as needed ``` The `with` statement also allows you to use multiple context managers within the same code block. As an example, the following code opens one text file, reads its entire content, converts it to upper case, and then writes the result to a different file: ```mojo with open(input_file, "r") as f_in, open(output_file, "w") as f_out: input_text = f_in.read() output_text = input_text.upper() f_out.write(output_text) ``` `FileHandle` is perhaps the most commonly used context manager. Other examples of context managers in the Mojo standard library are [`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile), [`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory), [`BlockingScopedLock`](/mojo/stdlib/utils/lock/BlockingScopedLock), and [`assert_raises`](/mojo/stdlib/testing/testing/assert_raises). You can also create your own custom context managers, as described in [Write a custom context manager](#write-a-custom-context-manager) below. ## Write a custom context manager Writing a custom context manager is a matter of defining a [struct](/mojo/manual/structs) that implements two special *dunder* methods ("double underscore" methods): `__enter__()` and `__exit__()`: - `__enter__()` is called by the `with` statement to enter the runtime context. The `__enter__()` method should initialize any state necessary for the context and return the context manager. - `__exit__()` is called when the `with` code block completes execution, even if the `with` code block terminates with a call to `continue`, `break`, or `return`. The `__exit__()` method should release any resources associated with the context. After the `__exit__()` method returns, the context manager is destroyed. If the `with` code block raises an error, then the `__exit__()` method runs before any error processing occurs (that is, before it is caught by a `try-except` structure or your program terminates). If you'd like to define conditional processing for error conditions in a `with` code block, you can implement an overloaded version of `__exit__()` that takes an `Error` argument. For more information, see [Define a conditional `__exit__()` method](#define-a-conditional-__exit__-method) below. For context managers that don't need to release resources or perform other actions on termination, you are not required to implement an `__exit__()` method. In that case the context manager is destroyed automatically after the `with` code block completes execution. Here is an example of implementing a `Timer` context manager, which prints the amount of time spent executing the `with` code block: ```mojo title="context_mgr.mojo" import sys import time @value struct Timer: var start_time: Int fn __init__(out self): self.start_time = 0 fn __enter__(mut self) -> Self: self.start_time = time.perf_counter_ns() return self fn __exit__(mut self): end_time = time.perf_counter_ns() elapsed_time_ms = round(((end_time - self.start_time) / 1e6), 3) print("Elapsed time:", elapsed_time_ms, "milliseconds") def main(): with Timer(): print("Beginning execution") time.sleep(1) if len(sys.argv()) > 1: raise "simulated error" time.sleep(1) print("Ending execution") ``` Running this example produces output like this: ```sh mojo context_mgr.mojo ``` ```output Beginning execution Ending execution Elapsed time: 2010.0 milliseconds ``` ```sh mojo context_mgr.mojo fail ``` ```output Beginning execution Elapsed time: 1002.0 milliseconds Unhandled exception caught during execution: simulated error ``` ### Define a conditional `__exit__()` method When creating a context manager, you can implement the `__exit__(self)` form of the `__exit__()` method to handle completion of the `with` statement under all circumstances including errors. However, you have the option of additionally implementing an overloaded version that is invoked instead when an error occurs in the `with` code block: ```mojo fn __exit__(self, error: Error) raises -> Bool ``` Given the `Error` that occurred as an argument, the method can: - Return `True` to suppress the error - Return `False` to re-raise the error - Raise a new error The following is an example of a context manager that suppresses only a certain type of error condition and propagates all others: ```mojo title="conditional_context_mgr.mojo" import sys import time @value struct ConditionalTimer: var start_time: Int fn __init__(out self): self.start_time = 0 fn __enter__(mut self) -> Self: self.start_time = time.perf_counter_ns() return self fn __exit__(mut self): end_time = time.perf_counter_ns() elapsed_time_ms = round(((end_time - self.start_time) / 1e6), 3) print("Elapsed time:", elapsed_time_ms, "milliseconds") fn __exit__(mut self, e: Error) raises -> Bool: if String(e) == "just a warning": print("Suppressing error:", e) self.__exit__() return True else: print("Propagating error") self.__exit__() return False def flaky_identity(n: Int) -> Int: if (n % 4) == 0: raise "really bad" elif (n % 2) == 0: raise "just a warning" else: return n def main(): for i in range(1, 9): with ConditionalTimer(): print("\nBeginning execution") print("i =", i) time.sleep(0.1) if i == 3: print("continue executed") continue j = flaky_identity(i) print("j =", j) print("Ending execution") ``` Running this example produces this output: ```output Beginning execution i = 1 j = 1 Ending execution Elapsed time: 105.0 milliseconds Beginning execution i = 2 Suppressing error: just a warning Elapsed time: 106.0 milliseconds Beginning execution i = 3 continue executed Elapsed time: 106.0 milliseconds Beginning execution i = 4 Propagating error Elapsed time: 106.0 milliseconds Unhandled exception caught during execution: really bad ``` --- ## eval_composed `eval_composed[composed_layout: ComposedLayout[Layout, Swizzle]](idx: UInt, offset: UInt = UInt(0)) -> UInt` Evaluate a composed layout with swizzle. Evaluates a `ComposedLayout[Layout, Swizzle]`. Applies the base layout, adds an optional offset, and then applies the swizzle. **Parameters:** * ​composed\_layout (`ComposedLayout[Layout, Swizzle]`): The composed layout to evaluate, consisting of a base Layout and a Swizzle transformation. **Args:** * ​idx (`UInt`): The input index to transform. * ​offset (`UInt`): Optional offset to apply between layouts (default: 0). **Returns:** The transformed index after applying both layouts. --- ## exists `exists[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path exists. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns True if the path exists and is not a broken symbolic link. --- ## exit `exit()` Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit. `exit[intable: Intable](code: intable)` Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit. **Parameters:** * ​intable (`Intable`): The type of the exit code. **Args:** * ​code (`intable`): The exit code. --- ## exp `exp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Calculates elementwise exponential of the input vector. Given an input vector $X$ and an output vector $Y$, sets $Y_i = e^{X_i}$ for each position $i$ in the input vector (where $e$ is the mathematical constant $e$). **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input SIMD vector. **Returns:** A SIMD vector containing $e$ raised to the power $X_i$ where $X_i$ is an element in the input SIMD vector. `exp[T: _Expable](x: T) -> T` Computes the exponential of the input value. **Parameters:** * ​T (`_Expable`): The type of the input value. **Args:** * ​x (`T`): The input value. **Returns:** The exponential of the input value. --- ## exp2 `exp2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform exp2 on. **Returns:** Vector containing $2^n$ computed elementwise, where n is an element in the input SIMD vector. --- ## expand_modes_alike `expand_modes_alike(shape_a: IntTuple[origin], stride_a: IntTuple[origin], shape_b: IntTuple[origin], stride_b: IntTuple[origin]) -> InlineArray[IntTuple, 3]` Aligns two shape-stride pairs to have the same hierarchical structure. This function is used to make two layouts compatible for operations by ensuring they have the same hierarchical structure, expanding scalar values into tuples as needed. **Args:** * ​shape\_a (`IntTuple[origin]`): The first shape tuple. * ​stride\_a (`IntTuple[origin]`): The first stride tuple. * ​shape\_b (`IntTuple[origin]`): The second shape tuple. * ​stride\_b (`IntTuple[origin]`): The second stride tuple. **Returns:** An array containing three tuples: the common shape, the expanded stride\_a, and the expanded stride\_b. `expand_modes_alike(layout_a: Layout, layout_b: Layout) -> InlineArray[Layout, 2]` Aligns two layouts to have the same hierarchical structure. This function tiles both layouts so they mirror each other's structure, making them compatible for operations that require matching hierarchies. Example: Given layouts with different structures: * layout\_0: (((3, (5, 2)), 4):((1, (24, 12)), 3)) * layout\_1: ((30, (2, 2)):(2, (60, 1))) The result would be two layouts with matching structures: * (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6))) * (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1))) ```mojo from layout import Layout, IntTuple from layout.layout import expand_modes_alike alias layout_0 = Layout( IntTuple(IntTuple(3, IntTuple(5, 2)), 4), IntTuple(IntTuple(1, IntTuple(24, 12)), 3), ) alias layout_1 = Layout( IntTuple(30, IntTuple(2, 2)), IntTuple(2, IntTuple(60, 1)) ) alias uc = expand_modes_alike(layout_0, layout_1) print(uc[0]) # (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6))) print(uc[1]) # (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1))) ``` . **Args:** * ​layout\_a (`Layout`): The first layout to align. * ​layout\_b (`Layout`): The second layout to align. **Returns:** An array containing two layouts with matching hierarchical structures. --- ## expand_strides `expand_strides(shape: IntTuple[origin], stride: Int) -> IntTuple` Expands a scalar stride into a stride tuple matching a shape tuple. This function creates a stride tuple that matches the structure of a shape tuple, with each stride value calculated based on the cumulative product of shape dimensions. **Args:** * ​shape (`IntTuple[origin]`): The shape tuple to match. * ​stride (`Int`): The base stride value to expand. **Returns:** A stride tuple matching the structure of the shape tuple. --- ## expanduser `expanduser[PathLike: PathLike, //](path: PathLike) -> String` Expands a tilde "\~" prefix in `path` to the user's home directory. For example, `~/folder` becomes `/home/current_user/folder`. On macOS and Linux a path starting with `~user/` will expand to the specified user's home directory, so `~user/folder` becomes `/home/user/folder`. If the home directory cannot be determined, or the `path` is not prefixed with "\~", the original path is returned unchanged. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path that is being expanded. **Returns:** The expanded path. --- ## expandvars `expandvars[PathLike: PathLike, //](path: PathLike) -> String` Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path that is being expanded. **Returns:** The expanded path. --- ## expect `expect[T: AnyTrivialRegType, //, expected_val: T](val: T) -> T` Provides information about expected (the most probable) value of `val`, which can be used by optimizers. Notes: Only works with integer/boolean types. **Parameters:** * ​T (`AnyTrivialRegType`): The type of the input value. * ​expected\_val (`T`): The expected value of `val`. **Args:** * ​val (`T`): The input value. **Returns:** The input value. --- ## ExplicitlyCopyable The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly. Unlike `Copyable`, which denotes types that are *implicitly* copyable, an explicitly copyable type can only be copied when the explicit copy initializer is called intentionally by the programmer. An explicit copy initializer is just a normal `__init__` method that takes a `read-only` argument of `Self`. Example implementing the `ExplicitlyCopyable` trait on `Foo` which requires the `fn(self) -> Self` method: ```mojo struct Foo(ExplicitlyCopyable): var s: String @implicit fn __init__(out self, s: String): self.s = s fn copy(self) -> Self: print("explicitly copying value") return Foo(self.s) ``` You can now copy objects inside a generic function: ```mojo fn copy_return[T: ExplicitlyCopyable](foo: T) -> T: var copy = foo.copy() return copy var foo = Foo("test") var res = copy_return(foo) ``` ```plaintext explicitly copying value ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `copy` `copy(self: _Self) -> _Self` Explicitly construct a copy of self. **Returns:** A copy of this value. --- ## expm1 `expm1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `expm1` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `expm1` of the input. --- ## extend_shape `extend_shape[rank: Int](in_shape: IndexList[rank], first: Int, last: Int) -> IndexList[(rank + 2)]` Extend input shape by inserting `first` and `last` at both ends. --- ## external_call `external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType, *types: AnyType](*args: *types) -> return_type` Calls an external function. **Parameters:** * ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function. * ​return\_type (`AnyTrivialRegType`): The return type. * ​\*types (`AnyType`): The argument types. **Args:** * ​\*args (`*types`): The arguments to pass to the external function. **Returns:** The external call result. `external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType](args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type` Calls an external function. **Parameters:** * ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function. * ​return\_type (`AnyTrivialRegType`): The return type. **Args:** * ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments to pass to the external function. **Returns:** The external call result. --- ## external_memory `external_memory[type: AnyTrivialRegType, *, address_space: AddressSpace, alignment: Int, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("extern_ptr_syml")]() -> UnsafePointer[type, address_space=address_space, alignment=alignment]` Gets a pointer to dynamically allocated external memory. This function returns a pointer to external memory that can be used for dynamic shared memory allocations in GPU kernels. The memory is allocated in the specified address space with the given alignment requirements. Note: * The memory is not initialized and must be explicitly written before reading. * The allocation size is determined at kernel launch time. * The pointer is only valid within the GPU kernel execution context. * Care must be taken to respect alignment requirements when accessing the memory. **Parameters:** * ​type (`AnyTrivialRegType`): The type of elements stored in the memory. Must be a trivial register type. * ​address\_space (`AddressSpace`): The memory address space to allocate in (e.g. shared, global). * ​alignment (`Int`): The minimum alignment requirement in bytes for the allocated memory. * ​name (`StringSlice[StaticConstantOrigin]`): Optional symbolic name for the external memory allocation. Defaults to "extern\_ptr\_syml". **Returns:** A properly aligned pointer to the allocated external memory in the specified address space. --- ## extrx `extrx(gpr: Int)` Extracts a row or moves it to x, result in amx0. --- ## extry `extry(gpr: Int)` Extracts a row or moves it to y, result in amx0. --- ## factorial `factorial(n: Int) -> Int` Computes the factorial of the integer. **Args:** * ​n (`Int`): The input value. Must be non-negative. **Returns:** The factorial of the input. Results are undefined for negative inputs. --- ## FAQ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; If this page doesn't answer your question, please ask us on our [Modular forum](https://forum.modular.com) or [Discord channel](https://www.discord.gg/modular). ## Distribution ### What are the system requirements? {#system-requirements} - macOS Ventura (13) or later - Apple silicon (M1/M2/M3/M4 processor) - Python 3.9 - 3.13 - Xcode or Xcode Command Line Tools - We currently don't support Mac GPUs - Ubuntu 22.04 LTS - x86-64 CPU (with [SSE4.2 or newer](https://www.intel.com/content/www/us/en/support/articles/000057621/processors.html)) or AWS Graviton2/3 CPU - Minimum 8 GiB RAM (or much more, depending on the model you run) - Python 3.9 - 3.13 - g++ or clang++ C++ compiler - To use GPUs, see the [GPU requirements](#gpu-requirements) Windows is not officially supported at this time. In the meantime, you can try MAX on Windows [with WSL](https://learn.microsoft.com/en-us/windows/wsl/install), using a compatible version of Ubuntu (see our requirements for Linux). ### What are the GPU requirements? {#gpu-requirements} The Modular Platform supports both CPUs and GPUs, so you don't need a GPU to serve a model or program with Mojo. But if you do want to accelerate your model with GPUs or program for GPUs with Mojo, Modular supports many GPU types. Because we don't test every variant of a GPU architecture, and support for new architectures will improve incrementally, we've divided our list of compatible GPUs into 3 tiers: #### Tier 1: Fully supported We provide full support and testing for the following data center GPUs: - NVIDIA H100 and H200 (Hopper) - NVIDIA A100 and A10 (Ampere) - NVIDIA L4 and L40 (Ada Lovelace) #### Tier 2: Confirmed compatibility We've confirmed compatibility with the following GPUs but we currently don't maintain tests for them: - NVIDIA RTX 40XX series (Ada Lovelace) - NVIDIA RTX 30XX series (Ampere) #### Tier 3: Limited compatibility We've either confirmed or received reports that the following GPUs work for GPU programming with Mojo and can execute basic graphs with MAX APIs. However, these GPUs currently can't run some GenAI models for various reasons: - NVIDIA RTX 20XX series (Turing) - NVIDIA T4 (Turing) - NVIDIA Jetson Orin and Orin Nano (Ampere) If you've had success with any GPUs not listed here, please [let us know on Discord](https://discord.gg/modular). #### Software requirements - NVIDIA GPU driver version 550 or higher. You can check your NVIDIA GPU driver version using [nvidia-smi](https://developer.nvidia.com/system-management-interface). To update, see the [NVIDIA driver docs](https://www.nvidia.com/en-us/drivers/). :::note Notes - Many GPUs are available in variants with different amounts of memory, and each AI model has different memory requirements. So even if your GPU architecture is listed as compatible, you must confirm that the available memory is sufficient for the model you're using. - Modular can serve lots of models on either CPU and GPU, but some models do require one or more GPUs. When you browse our [model repository](https://builds.modular.com/?category=models), you can filter by models that support either CPU or GPU. ::: ### Why bundle Mojo with MAX? Integrating Mojo and MAX into a single package is the best way to ensure interoperability between Mojo and MAX for all users, and avoid version conflicts that happen when installing them separately. Moreover, we built Mojo as a [core technology for MAX](/mojo/why-mojo), and you can use it to [extend MAX Engine](/max/custom-ops), so MAX clearly depends on Mojo. On the other hand, writing Mojo code that runs on both CPUs and GPUs (and other accelerators) requires runtime components and orchestration logic that falls outside the domain of Mojo, and into the domain of MAX. That is, MAX isn't just a framework for AI development, it's also a framework for general heterogeneous compute. As such, writing Mojo programs that can execute across heterogeneous hardware depends on MAX. Nothing has changed for Mojo developers—you can still build and develop in Mojo like you always have. The only difference is that you're now able to seamlessly step into general-purpose GPU programming (coming soon). ### Will MAX be open-sourced? We want to contribute a lot to open source, but we also want to do it right. Our team has decades of experience building open-source projects, and we believe it's very important to create an inclusive and vibrant community, which takes a lot of work. We've already begun open-sourcing parts of the MAX framework, including our [Python serving library](https://github.com/modular/modular/tree/main/max/serve), [MAX model architectures](https://github.com/modular/modular/tree/main/max/pipelines/architectures), and [GPU kernels](https://github.com/modular/modular/tree/main/max/kernels/src/nn). To get the latest updates, [sign up for our newsletter](https://www.modular.com/modverse#signup). ## Functionality ### What hardware does MAX support? MAX supports a broad range of CPUs, including Intel, AMD, and ARM variants, as well as GPUs from NVIDIA and AMD (coming soon). For more specifics, see the above [system requirements](#system-requirements). ### What clouds and services can I deploy MAX onto? You can deploy our MAX container across a variety of VM and Kubernetes-based cloud services, including AWS, GCP, and Azure. To get started with any of them, check out our [tutorials using MAX Serve](/max/tutorials?filterByTags&tag=serve). ### Can I run MAX locally? Yes. MAX has support for MacOS and ARM hardware, meaning it can be run on your local laptop for exploration and testing purposes. ### Will MAX support distributed inference of large models? Yes, it will support executing large models that do not fit into the memory of a single device. This isn't available yet, so stay tuned! ## Installation ### Can I install both stable and nightly builds? Yes, it's safe and easy to use the stable and nightly builds for different projects, each with their own virtual environment and package dependencies. For more information, read the [Install guide](/max/packages). ### Does the MAX SDK collect telemetry? Yes, the MAX SDK collects basic system information, session durations, compiler events, and crash reports that enable us to identify, analyze, and prioritize issues. The MAX container for model serving also collects performance metrics such as time to first token and input processing time. This telemetry is crucial to help us quickly identify problems and improve our products for you. Without this telemetry, we would rely solely on user-submitted bug reports, which are limited and would severely limit our performance insights. You can opt-out of some telemetry, such as compiler events and crash reports. However, package install/update/uninstall events, basic system information, and session durations (the amount of time spent running MAX Engine) cannot be disabled (see the [Terms of use](https://www.modular.com/legal/terms)). To disable telemetry for compiler events and crash reports, run this command in your project environment (you must run this for each project): ```sh magic telemetry --disable ``` To disable serving telemetry, see the [MAX container documentation](/max/container#metrics). --- ## fast_div Implements the fast division algorithm. This method replaces division by constants with a sequence of shifts and multiplications, significantly optimizing division performance. ## Structs * [​`FastDiv`](./FastDiv): Implements fast division for a given type. --- ## FastDiv `@register_passable(trivial)` `struct FastDiv[type: DType]` Implements fast division for a given type. This struct provides optimized division by a constant divisor, replacing the division operation with a series of shifts and multiplications. This approach significantly improves performance, especially in scenarios where division is a frequent operation. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `uint_type` `alias uint_type = _uint_type_of_width[::Int]()` ## Methods ### `__init__` `@implicit` `__init__(divisor: Int = 1) -> Self` Initializes FastDiv with the divisor. **Constraints:** ConstraintError: If the bitwidth of the type is > 32. **Args:** * ​divisor (`Int`): The divisor to use for fast division. Defaults to 1. ### `__rtruediv__` `__rtruediv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Divides the other scalar by the divisor (true division). Uses the fast division algorithm. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** The result of the division. ### `__rmod__` `__rmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Computes the remainder of division. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** The remainder. ### `__rdiv__` `__rdiv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Divides the other scalar by the divisor. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** The result of the division. ### `__divmod__` `__divmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> Tuple[SIMD[_uint_type_of_width[::Int](), 1], SIMD[_uint_type_of_width[::Int](), 1]]` Computes both quotient and remainder. **Args:** * ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend. **Returns:** A tuple containing the quotient and remainder. --- ## Featured tutorials import Tutorials from '@site/src/components/Tutorials'; export const tutorials = { featured: [ 'start-a-chat-endpoint', 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes' ], new: [ 'run-embeddings-with-max-serve', 'build-custom-ops', ], popular: [ 'max-pipeline-bring-your-own-model', 'deploy-serverless-cloud-run', 'get-started-with-max-graph-in-python' ] } --- ## fence_mbarrier_init `fence_mbarrier_init()` Creates a memory fence after mbarrier initialization. This function establishes a memory barrier that ensures the proper initialization of memory barriers (mbarrier) before they are used. It guarantees that the mbarrier initialization is complete and visible to all threads before subsequent operations. Note: Should be called immediately after mbarrier initialization to ensure proper synchronization semantics. --- ## fence_proxy_tensormap_generic_sys_acquire `fence_proxy_tensormap_generic_sys_acquire[type: AnyType](ptr: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin], size: SIMD[int32, 1])` Acquires a system-wide memory fence for tensor map operations. This function establishes a memory fence that ensures proper synchronization between tensor map operations and system memory. It guarantees that all previous memory operations are completed before subsequent tensor map accesses. Note: This is a low-level synchronization primitive typically used in conjunction with TMA (Tensor Memory Access) operations on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type of the tensor map object being synchronized. **Args:** * ​ptr (`UnsafePointer[type, alignment=alignment, mut=mut, origin=origin]`): Pointer to the tensor map object in system memory that needs to be synchronized. * ​size (`SIMD[int32, 1]`): The size in bytes of the tensor map object being synchronized. --- ## fence_proxy_tensormap_generic_sys_release `fence_proxy_tensormap_generic_sys_release()` Releases the system-wide memory fence for tensor map operations. This function releases the memory fence previously established by the acquire operation. It ensures that all tensor map operations are completed and visible to the system before proceeding. Note: Should be called after tensor map operations are complete to maintain proper memory ordering semantics. --- ## ffi Implements a foreign functions interface (FFI). ## Aliases ### `c_char` `alias c_char = SIMD[int8, 1]` C `char` type. ### `c_double` `alias c_double = SIMD[float64, 1]` C `double` type. ### `c_float` `alias c_float = SIMD[float32, 1]` C `float` type. ### `c_int` `alias c_int = SIMD[int32, 1]` C `int` type. The C `int` type is typically a signed 32-bit integer on commonly used targets today. ### `c_long` `alias c_long = SIMD[_c_long_dtype(), 1]` C `long` type. The C `long` type is typically a signed 64-bit integer on macOS and Linux, and a 32-bit integer on Windows. ### `c_long_long` `alias c_long_long = SIMD[_c_long_long_dtype(), 1]` C `long long` type. The C `long long` type is typically a signed 64-bit integer on commonly used targets today. ### `c_short` `alias c_short = SIMD[int16, 1]` C `short` type. ### `c_size_t` `alias c_size_t = UInt` C `size_t` type. ### `c_ssize_t` `alias c_ssize_t = Int` C `ssize_t` type. ### `c_uchar` `alias c_uchar = SIMD[uint8, 1]` C `unsigned char` type. ### `c_uint` `alias c_uint = SIMD[uint32, 1]` C `unsigned int` type. ### `c_ushort` `alias c_ushort = SIMD[uint16, 1]` C `unsigned short` type. ### `DEFAULT_RTLD` `alias DEFAULT_RTLD = (256 if os_is_linux() else 8 | 2)` ### `OpaquePointer` `alias OpaquePointer = UnsafePointer[NoneType]` An opaque pointer, equivalent to the C `void*` type. ## Structs * [​`DLHandle`](/mojo/stdlib/sys/ffi/DLHandle): Represents a dynamically linked library that can be loaded and unloaded. * [​`RTLD`](/mojo/stdlib/sys/ffi/RTLD): Enumeration of the RTLD flags used during dynamic library loading. ## Functions * [​`external_call`](/mojo/stdlib/sys/ffi/external_call): Calls an external function. --- ## file Provides APIs to read and write files. These are Mojo built-ins, so you don't need to import them. For example, here's how to read a file: ```mojo var f = open("my_file.txt", "r") print(f.read()) f.close() ``` Or use a `with` statement to close the file automatically: ```mojo with open("my_file.txt", "r") as f: print(f.read()) ``` ## Structs * [​`FileHandle`](/mojo/stdlib/builtin/file/FileHandle): File handle to an opened file. ## Functions * [​`open`](/mojo/stdlib/builtin/file/open): Opens the file specified by path using the mode provided, returning a FileHandle. --- ## file_descriptor Higher level abstraction for file stream. These are Mojo built-ins, so you don't need to import them. For example, here's how to print to a file ```mojo var f = open("my_file.txt", "r") print("hello", file=f^) f.close() ``` ## Structs * [​`FileDescriptor`](/mojo/stdlib/builtin/file_descriptor/FileDescriptor): File descriptor of a file. --- ## FileDescriptor `@register_passable(trivial)` `struct FileDescriptor` File descriptor of a file. ## Fields * ​value (`Int`): The underlying value of the file descriptor. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writer` ## Methods ### `__init__` `__init__(value: Int = 1) -> Self` Constructs the file descriptor from an integer. **Args:** * ​value (`Int`): The file identifier (Default 1 = stdout). `@implicit` `__init__(f: FileHandle) -> Self` Constructs the file descriptor from a file handle. **Args:** * ​f (`FileHandle`): The file handle. ### `__write_bytes_cpu` `__write_bytes_cpu(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `read_bytes` `read_bytes(mut self, buffer: Span[SIMD[uint8, 1], origin]) -> UInt` Read a number of bytes from the file into a buffer. Notes: [Reference](https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html). **Args:** * ​buffer (`Span[SIMD[uint8, 1], origin]`): A `Span[Byte]` to read bytes into. Read up to `len(buffer)` number of bytes. **Returns:** Actual number of bytes read. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. --- ## FileHandle `struct FileHandle` File handle to an opened file. ## Fields * ​handle (`UnsafePointer[NoneType]`): The underlying pointer to the file handle. ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writer` ## Methods ### `__init__` `__init__(out self)` Default constructor. `__init__(out self, path: StringSlice[origin], mode: StringSlice[origin])` Construct the FileHandle using the file path and mode. **Args:** * ​path (`StringSlice[origin]`): The file path. * ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w" or "rw"). ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves constructor for the file handle. **Args:** * ​existing (`Self`): The existing file handle. ### `__del__` `__del__(owned self)` Closes the file handle. ### `close` `close(mut self)` Closes the file handle. ### `read` `read(self, size: Int = -1) -> String` Reads data from a file and sets the file handle seek position. If size is left as the default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will set the String length to the total number of bytes, and read all the data. Examples: Read the entire file into a String: ```mojo var file = open("/tmp/example.txt", "r") var string = file.read() print(string) ``` Read the first 8 bytes, skip 2 bytes, and then read the next 8 bytes: ```mojo import os var file = open("/tmp/example.txt", "r") var word1 = file.read(8) print(word1) _ = file.seek(2, os.SEEK_CUR) var word2 = file.read(8) print(word2) ``` Read the last 8 bytes in the file, then the first 8 bytes ```mojo _ = file.seek(-8, os.SEEK_END) var last_word = file.read(8) print(last_word) _ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file var first_word = file.read(8) print(first_word) ``` . **Args:** * ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF). **Returns:** The contents of the file. **Raises:** An error if this file handle is invalid, or if the file read returned a failure. `read[dtype: DType, origin: MutableOrigin](self, buffer: Span[SIMD[dtype, 1], origin]) -> Int` Read data from the file into the Span. This will read n bytes from the file into the input Span where `0 dtype (`DType`): The type that the data will be represented as. * ​origin (`MutableOrigin`): The origin of the passed in Span. **Args:** * ​buffer (`Span[SIMD[dtype, 1], origin]`): The mutable Span to read data into. **Returns:** The total amount of data that was read in bytes. **Raises:** An error if this file handle is invalid, or if the file read returned a failure. ### `read_bytes` `read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]` Reads data from a file and sets the file handle seek position. If size is left as default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will be handled and set the List length to the total number of bytes in the file. Examples: Reading the entire file into a List\[Int8]: ```mojo var file = open("/tmp/example.txt", "r") var string = file.read_bytes() ``` Reading the first 8 bytes, skipping 2 bytes, and then reading the next 8 bytes: ```mojo import os var file = open("/tmp/example.txt", "r") var list1 = file.read(8) _ = file.seek(2, os.SEEK_CUR) var list2 = file.read(8) ``` Reading the last 8 bytes in the file, then the first 8 bytes: ```mojo import os var file = open("/tmp/example.txt", "r") _ = file.seek(-8, os.SEEK_END) var last_data = file.read(8) _ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file var first_data = file.read(8) ``` . **Args:** * ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF). **Returns:** The contents of the file. **Raises:** An error if this file handle is invalid, or if the file read returned a failure. ### `seek` `seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]` Seeks to the given offset in the file. Examples: Skip 32 bytes from the current read position: ```mojo import os var f = open("/tmp/example.txt", "r") _ = f.seek(32, os.SEEK_CUR) ``` Start from 32 bytes from the end of the file: ```mojo import os var f = open("/tmp/example.txt", "r") _ = f.seek(-32, os.SEEK_END) ``` . **Args:** * ​offset (`SIMD[uint64, 1]`): The byte offset to seek to. * ​whence (`SIMD[uint8, 1]`): The reference point for the offset: os.SEEK\_SET = 0: start of file (Default). os.SEEK\_CUR = 1: current position. os.SEEK\_END = 2: end of file. **Returns:** The resulting byte offset from the start of the file. **Raises:** An error if this file handle is invalid, or if file seek returned a failure. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. ### `__enter__` `__enter__(owned self) -> Self` The function to call when entering the context. **Returns:** The file handle. --- ## Fill `@register_passable(trivial)` `struct Fill` Represents memory fill patterns for GPU memory operations. This struct defines different fill patterns that can be used when allocating or initializing GPU memory. The patterns control how memory is initialized, which can be important for debugging and performance optimization. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NAN` `alias NAN = Fill(2)` Fill memory with NaN values. Useful for debugging floating point computations. ### `NONE` `alias NONE = Fill(0)` No fill pattern - memory is left uninitialized. ### `ZERO` `alias ZERO = Fill(1)` Fill memory with zeros. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two Fill instances have the same fill pattern. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two Fill instances have different fill patterns. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two Fill instances are identical. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two Fill instances are not identical. **Args:** * ​other (`Self`): The Fill instance to compare against. **Returns:** True if the fill patterns are not identical, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the fill pattern. Converts the fill pattern into a human-readable string for debugging and display purposes. **Returns:** A string describing the fill pattern. --- ## fill_like `fill_like(src: IntTuple[origin], val: Int) -> IntTuple` Creates an `IntTuple` with the same structure as the source but filled with a specified value. This function recursively traverses the source `IntTuple` and creates a new `IntTuple` with identical structure, but with all leaf values replaced by the specified value. **Args:** * ​src (`IntTuple[origin]`): The source `IntTuple` whose structure will be copied. * ​val (`Int`): The integer value to fill the new `IntTuple` with. **Returns:** A new `IntTuple` with the same structure as src but filled with val. --- ## flare_mla_decoding `flare_mla_decoding[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` MLA decoding kernel that would only be called in the optimized compute graph. The Q input has a shape of \[seq\_len, num\_heads, depth]. The K input has a shape of \[seq\_len, 1, depth]. The V tensor is derived by reusing K, where V = K\[:, :, :depth\_v]. Specifically, for DeepSeek V2/3, depth = 576 and depth\_v = 512. This kernel computes attention without needing to load V twice. This kernel only handles decoding requests. In this case q\_max\_seq\_len = 1. This kernel handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid\_length argument. `flare_mla_decoding[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flare_mla_decoding_dispatch `flare_mla_decoding_dispatch[rank: Int, k_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flare_mla_prefill `flare_mla_prefill[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))` MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs. The Q input has a shape of \[seq\_len, num\_heads, q\_depth]. The K and V input has a shape of \[cache\_len, num\_heads, depth]. The K\_rope input is retrieved from the KV cache, with a shape of \[cache\_len, 1, q\_depth - depth]. Specifically, for DeepSeek V2/3, depth = 128 and q\_depth = 192. When computing attention scores (Q @ K), each head of K is smaller than Q head. The missing 64 elements of K are retrieved from the K cache, and broadcasted to all the heads. This kernel also handles that output has reduced dimension compared to input Q. This kernel handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid\_length argument. `flare_mla_prefill[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: NDBuffer[type, 4, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))` --- ## flare_mla_prefill_dispatch `flare_mla_prefill_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, q_depth: Int = 192, cache_depth: Int = 576, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), _ndbuffer_mha_operand: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, k_rope: k_rope_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, scale: SIMD[float32, 1], ctx: DeviceContext, softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))` --- ## Flash attention Flash attention is an optimization technique to compute attention blocks in [transformer](transformer.mdx) models. Traditional [attention](attention.mdx) requires storing large intermediate activation tensors, leading to high memory overhead that slows execution because it requires frequent memory transfers between high-bandwidth memory (HBM) and faster SRAM on the GPU. Flash attention improves performance and reduces the memory footprint for attention layers by reordering computations with techniques such as tiling to compute attention scores in blocks, and keeping only small chunks of activations in the faster on-chip SRAM. This allows the model to process much longer sequences without running into memory limitations. By improving the efficiency of attention layers, flash attention enables LLMs to handle much longer contexts, improving their ability to understand and generate complex text. --- ## flash_attention `flash_attention[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])` --- ## flash_attention ## Functions * [​`flash_attention`](./flash_attention): * [​`flash_attention_kv_cache`](./flash_attention_kv_cache): * [​`flash_attention_split_kv`](./flash_attention_split_kv): Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments. --- ## flash_attention `flash_attention[rank: Int, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], scale: SIMD[float32, 1], context: DeviceContextPtr = DeviceContextPtr(), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` `flash_attention[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False, naive_kernel: Bool = False, assert_write_mode: Int = 0](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Flash attention 2 algorithm. Compute: (1) Transpose (Q) BSHD -> BHSD; (2) Transpose (K) BSHD -> BHSD; (3) Transpose (V) BSHD -> BHSD; (4) P = Bmm(Q, K), P is also called "score"; (5) P = P \* scale + mask; (6) P = softmax(P); (7) O = Bmm(P, V) (8) Output = Transpose(O). B, S, H, D denote batch size, sequence length, head count and depth, respectively. (1), (2), (3) happens while loading the data into shared memory. (8) happens when writing output to global memory. All inputs (query, key, and value) must have BSHD layout. The mask can be BSS or BHSS. This kernel also handles grouped attention optimization. In this case the shape of K and V are BShD where h = H / num\_groups. This kernels handles batches with different valid lengths (i.e., before the padding). Such lengths are passed in valid\_length argument. `flash_attention[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flash_attention_dispatch `flash_attention_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_flash_attention_applicable: Bool = True, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], is_token_generation: Bool, ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## flash_attention_hw_supported `flash_attention_hw_supported[qkv_type: DType]() -> Bool` --- ## flash_attention_kv_cache `flash_attention_kv_cache[type: DType, cache_t: KVCacheT, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])` `flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])` `flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides])` Entrypoint for ragged tensors. --- ## flash_attention_split_kv `flash_attention_split_kv[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_k_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], k_cache_shape: IndexList[(rank + 1)], v_cache_shape: IndexList[(rank + 1)], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])` Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments. This works around the fact that fusion can't currently look through concat. So this kernel does an in-place concat fusion by changing the input lambdas `input_{k,v}_cache_fn_wrapper` to take previous sequence KV elements from the KV cache, and current KV elements from tensors `k` and `v`. --- ## FlashAttentionAlgorithm `@register_passable(trivial)` `struct FlashAttentionAlgorithm` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `FLASH_ATTENTION_1` `alias FLASH_ATTENTION_1 = FlashAttentionAlgorithm(1)` ### `FLASH_ATTENTION_2` `alias FLASH_ATTENTION_2 = FlashAttentionAlgorithm(2)` ### `FLASH_ATTENTION_3` `alias FLASH_ATTENTION_3 = FlashAttentionAlgorithm(3)` ### `NAIVE` `alias NAIVE = FlashAttentionAlgorithm(0)` ## Methods ### `__init__` `__init__() -> Self` `@implicit` `__init__(value: Int) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## flatten `flatten(t: IntTuple[origin]) -> IntTuple` Flatten a nested `IntTuple` into a single-level `IntTuple`. This function converts a hierarchical `IntTuple` structure into a flat sequence of integer values, preserving the order of elements. **Args:** * ​t (`IntTuple[origin]`): The nested `IntTuple` to flatten. **Returns:** A new `IntTuple` containing all integer values in a flat structure. --- ## float_literal Implements the FloatLiteral class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral): Mojo floating point literal type. --- ## floatable Implements the `Floatable` and `FloatableRaising` traits. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Floatable`](/mojo/stdlib/builtin/floatable/Floatable): The `Floatable` trait describes a type that can be converted to a Float64. * [​`FloatableRaising`](/mojo/stdlib/builtin/floatable/FloatableRaising): The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string). --- ## Floatable The `Floatable` trait describes a type that can be converted to a Float64. This trait requires the type to implement the `__float__()` method. For example: ```mojo struct Foo(Floatable): var i: Float64 fn __float__(self) -> Float64: return self.i ``` A `Foo` can now be converted to a `Float64`: ```mojo var f = Float64(Foo(5.5)) ``` **Note:** If the `__float__()` method can raise an error, use the [`FloatableRaising`](/mojo/stdlib/builtin/floatable/floatableraising) trait instead. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__float__` `__float__(self: _Self) -> SIMD[float64, 1]` Get the float point representation of the value. **Returns:** The float point representation of the value. --- ## FloatableRaising The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string). This trait requires the type to implement the `__float__()` method, which can raise an error. For example: ```mojo from utils import Variant struct MaybeFloat(FloatableRaising): var value: Variant[Float64, NoneType] fn __float__(self) raises -> Float64: if self.value.isa[NoneType](): raise "Float expected" return self.value[Float64] ``` A `MaybeFloat` can now be converted to `Float64`: ```mojo try: print(Float64(MaybeFloat(4.6))) except: print("error occurred") ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__float__` `__float__(self: _Self) -> SIMD[float64, 1]` Get the float point representation of the value. **Returns:** The float point representation of the value. **Raises:** If the type does not have a float point representation. --- ## FloatLiteral `@register_passable(trivial)` `struct FloatLiteral[value: !pop.float_literal]` Mojo floating point literal type. ## Parameters * ​value (`!pop.float_literal`): The underlying infinite precision floating point value. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Floatable`, `ImplicitlyBoolable`, `Intable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Aliases ### `infinity` `alias infinity = inf` ### `nan` `alias nan` ### `negative_infinity` `alias negative_infinity = -inf` ### `negative_zero` `alias negative_zero = -0.0` ## Methods ### `__init__` `__init__() -> Self` Create a FloatLiteral for any parameter value. `@implicit` `__init__(value: IntLiteral[value]) -> FloatLiteral[#pop.int_to_float_literal]` Convert an IntLiteral to a FloatLiteral value. **Args:** * ​value (`IntLiteral[value]`): The IntLiteral value. ### `__bool__` `__bool__(self) -> Bool` A FloatLiteral value is true if it is non-zero. **Returns:** True if non-zero. ### `__neg__` `__neg__(self) -> FloatLiteral[#pop.float_literal_bin>]` Return the negation of the FloatLiteral value. **Returns:** The negated FloatLiteral value. ### `__lt__` `__lt__(self, rhs: FloatLiteral[value]) -> Bool` Less than comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is less than `rhs`. ### `__le__` `__le__(self, rhs: FloatLiteral[value]) -> Bool` Less than or equal to comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is less than or equal to `rhs`. ### `__eq__` `__eq__(self, rhs: FloatLiteral[value]) -> Bool` Compare for equality. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: FloatLiteral[value]) -> Bool` Compare for inequality. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if they are not equal. ### `__gt__` `__gt__(self, rhs: FloatLiteral[value]) -> Bool` Greater than comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is greater than `rhs`. ### `__ge__` `__ge__(self, rhs: FloatLiteral[value]) -> Bool` Greater than or equal to comparison. **Args:** * ​rhs (`FloatLiteral[value]`): The value to compare. **Returns:** True if this value is greater than or equal to `rhs`. ### `__add__` `__add__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Add two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to add. **Returns:** The sum of the two values. ### `__sub__` `__sub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Subtract two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to subtract. **Returns:** The difference of the two values. ### `__mul__` `__mul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Multiply two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to multiply. **Returns:** The product of the two values. ### `__truediv__` `__truediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Divide two FloatLiterals. **Args:** * ​rhs (`FloatLiteral[value]`): The value to divide. **Returns:** The quotient of the two values. ### `__floordiv__` `__floordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Returns self divided by rhs, rounded down to the nearest integer. **Args:** * ​rhs (`FloatLiteral[value]`): The divisor value. **Returns:** `floor(self / rhs)` value. ### `__mod__` `__mod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]` Return the remainder of self divided by rhs. **Args:** * ​rhs (`FloatLiteral[value]`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__radd__` `__radd__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed addition operator. **Args:** * ​rhs (`FloatLiteral[value]`): The value to add. **Returns:** The sum of this and the given value. ### `__rsub__` `__rsub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed subtraction operator. **Args:** * ​rhs (`FloatLiteral[value]`): The value to subtract from. **Returns:** The result of subtracting this from the given value. ### `__rmul__` `__rmul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed multiplication operator. **Args:** * ​rhs (`FloatLiteral[value]`): The value to multiply. **Returns:** The product of the given number and this. ### `__rtruediv__` `__rtruediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Reversed division. **Args:** * ​rhs (`FloatLiteral[value]`): The value to be divided by this. **Returns:** The result of dividing the given value by this. ### `__rfloordiv__` `__rfloordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]` Returns rhs divided by self, rounded down to the nearest integer. **Args:** * ​rhs (`FloatLiteral[value]`): The value to be divided by self. **Returns:** `floor(rhs / self)` value. ### `__rmod__` `__rmod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]` Return the remainder of rhs divided by self. **Args:** * ​rhs (`FloatLiteral[value]`): The value to divide on. **Returns:** The remainder of dividing rhs by self. ### `is_nan` `is_nan(self) -> Bool` Return whether the FloatLiteral is nan. Since `nan == nan` is False, this provides a way to check for nan-ness. **Returns:** True, if the value is nan, False otherwise. ### `is_neg_zero` `is_neg_zero(self) -> Bool` Return whether the FloatLiteral is negative zero. Since `FloatLiteral.negative_zero == 0.0` is True, this provides a way to check if the FloatLiteral is negative zero. **Returns:** True, if the value is negative zero, False otherwise. ### `__str__` `__str__(self) -> String` Get the float as a string. **Returns:** A string representation. ### `__int_literal__` `__int_literal__(self) -> IntLiteral[#pop.float_to_int_literal]` Casts the floating point value to an IntLiteral. If there is a fractional component, then the value is truncated towards zero. Eg. `(4.5).__int_literal__()` returns `4`, and `(-3.7).__int_literal__()` returns `-3`. **Returns:** The value as an integer. ### `__int__` `__int__(self) -> Int` Converts the FloatLiteral value to an Int. If there is a fractional component, then the value is truncated towards zero. Eg. `(4.5).__int__()` returns `4`, and `(-3.7).__int__()` returns `-3`. **Returns:** The value as an integer. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Converts the FloatLiteral to a concrete Float64. **Returns:** The Float value. ### `__as_bool__` `__as_bool__(self) -> Bool` A FloatLiteral value is true if it is non-zero. **Returns:** True if non-zero. ### `__ceildiv__` `__ceildiv__(self, denominator: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin>>, #pop.float_literal>]` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`FloatLiteral[value]`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. --- ## floor `floor[T: Floorable, //](value: T) -> T` Get the floor value of the given object. **Parameters:** * ​T (`Floorable`): The type conforming to `Floorable`. **Args:** * ​value (`T`): The object to get the floor value of. **Returns:** The floor value of the object. --- ## Floorable The `Floorable` trait describes a type that defines a floor operation. Types that conform to `Floorable` will work with the builtin `floor` function. The floor operation always returns the same type as the input. For example: ```mojo from math import Floorable, floor @value struct Complex(Floorable): var re: Float64 var im: Float64 fn __floor__(self) -> Self: return Self(floor(self.re), floor(self.im)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__floor__` `__floor__(self: _Self) -> _Self` Return the floor of the Int value, which is itself. **Returns:** The Int value itself. --- ## FlushDenormals `struct FlushDenormals` Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit. ## Fields * ​state (`SIMD[int32, 1]`): The current state. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initializes the FlushDenormals. ### `__enter__` `__enter__(self)` Enters the context. This will set denormals to zero. ### `__exit__` `__exit__(self)` Exits the context. This will restore the prior FPState. --- ## fma `fma[mode: StringSlice[StaticConstantOrigin], type: DType](z_row_index: Int, x_row_index: Int, y_row_index: Int, clear_z: Bool)` --- ## fma `fma(a: Int, b: Int, c: Int) -> Int` Performs `fma` (fused multiply-add) on the inputs. The result is `(a * b) + c`. **Args:** * ​a (`Int`): The first input. * ​b (`Int`): The second input. * ​c (`Int`): The third input. **Returns:** `(a * b) + c`. `fma(a: UInt, b: UInt, c: UInt) -> UInt` Performs `fma` (fused multiply-add) on the inputs. The result is `(a * b) + c`. **Args:** * ​a (`UInt`): The first input. * ​b (`UInt`): The second input. * ​c (`UInt`): The third input. **Returns:** `(a * b) + c`. `fma[dtype: DType, width: Int, //](a: SIMD[dtype, width], b: SIMD[dtype, width], c: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise `fma` (fused multiply-add) on the inputs. Each element in the result SIMD vector is $(A_i * B_i) + C_i$, where $A_i$, $B_i$ and $C_i$ are elements at index $i$ in a, b, and c respectively. **Parameters:** * ​dtype (`DType`): The `dtype` of the input SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​a (`SIMD[dtype, width]`): The first vector of inputs. * ​b (`SIMD[dtype, width]`): The second vector of inputs. * ​c (`SIMD[dtype, width]`): The third vector of inputs. **Returns:** Elementwise `fma` of a, b and c. --- ## fma16 `fma16(gpr: Int)` Float16 matrix multiply and subtract. --- ## fma32 `fma32(gpr: Int)` Float32 matrix multiply and add. --- ## fma64 `fma64(gpr: Int)` Float64 matrix multiply and add. --- ## fms16 `fms16(gpr: Int)` Float16 matrix multiply and add. --- ## fold `fold[dtype: DType, input_dim: DimList, output_dim: DimList, target: StringSlice[StaticConstantOrigin]](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output: NDBuffer[dtype, 4, MutableAnyOrigin, output_dim], output_size: IndexList[2], kernel_size: IndexList[2], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], ctx: DeviceContextPtr)` Folds array of sliding local blocks into a single output tensor. **Args:** * ​input (`NDBuffer[dtype, 3, MutableAnyOrigin, input_dim]`): Input tensor to fold, shape \[N, C x kernel size, num\_blocks]. * ​output (`NDBuffer[dtype, 4, MutableAnyOrigin, output_dim]`): Output tensor to write to, shape \[N, C, H, W]. * ​output\_size (`IndexList[2]`): Spacial shape of the output tensor (H, W). * ​kernel\_size (`IndexList[2]`): Size of the sliding blocks. * ​stride (`IndexList[2]`): Stride of the sliding blocks. * ​dilation (`IndexList[2]`): Dilation of the sliding blocks. * ​padding (`IndexList[2]`): 0-paddings to be added on both sides of the inputs. * ​ctx (`DeviceContextPtr`): DeviceContextPtr. --- ## fold Implements the fold operation. ## Functions * [​`fold`](./fold): Folds array of sliding local blocks into a single output tensor. * [​`fold_shape`](./fold_shape): Returns the shape of the output tensor of the fold operation. --- ## fold_shape `fold_shape[dtype: DType, input_dim: DimList](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output_size: IndexList[2], kernel_size: IndexList[2], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2]) -> IndexList[4]` Returns the shape of the output tensor of the fold operation. --- ## foreach `foreach[type: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[type, $0], *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())` Apply the function `func` to each element of the tensor slice. **Parameters:** * ​type (`DType`): The data type of the elements in the tensor slice. * ​rank (`Int`): The rank of the tensor slice. * ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[type, $0]`): The function to apply to each element of the tensor slice. * ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu"). * ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value). * ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False). * ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description. **Args:** * ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The output tensor slice which receives the return values from `func`. * ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation). `foreach[: origin.set, type: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[type, $0], out_func: fn[Int](IndexList[rank]) capturing -> None, *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())` Apply the function `func` to each element of the tensor slice. **Parameters:** * ​type (`DType`): The data type of the elements in the tensor slice. * ​rank (`Int`): The rank of the tensor slice. * ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[type, $0]`): The function to apply to each element of the tensor slice. * ​out\_func (`fn[Int](IndexList[rank]) capturing -> None`): The function to apply on each output element. * ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu"). * ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value). * ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False). * ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description. **Args:** * ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The input tensor slice which the consumed values. * ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation). --- ## form_q `form_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], Q: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`. --- ## format String formatting utilities for Mojo. This module provides string formatting functionality similar to Python's `str.format()` method. The `format()` method (available on the [`String`](/mojo/stdlib/collections/string/string/String#format) and [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#format) types) takes the current string as a template (or "format string"), which can contain literal text and/or replacement fields delimited by curly braces (`{}`). The replacement fields are replaced with the values of the arguments. Replacement fields can mapped to the arguments in one of two ways: * Automatic indexing by argument position: ```mojo var s = String("{} is {}").format("Mojo", "🔥") ``` * Manual indexing by argument position: ```mojo var s = String("{1} is {0}").format("hot", "🔥") ``` The replacement fields can also contain the `!r` or `!s` conversion flags, to indicate whether the argument should be formatted using `repr()` or `String()`, respectively: ```mojo var s = String("{!r}").format(myComplicatedObject) ``` Note that the following features from Python's `str.format()` are **not yet supported**: * Named arguments (for example `"{name} is {adjective}"`). * Accessing the attributes of an argument value (for example, `"{0.name}"`. * Accessing an indexed value from the argument (for example, `"{1[0]}"`). * Format specifiers for controlling output format (width, precision, and so on). Example: ```mojo from collections.string import String # Basic formatting var s1 = String("Hello {0}!").format("World") # Hello World! # Multiple arguments var s2 = String("{0} plus {1} equals {2}").format(1, 2, 3) # 1 plus 2 equals 3 # Conversion flags var s4 = String("{!r}").format("test") # "'test'" ``` This module has no public API; its functionality is available through the [`String.format()`](/mojo/stdlib/collections/string/string/String#format) and [`StringSlice.format()`](/mojo/stdlib/collections/string/string_slice/StringSlice#format) methods. --- ## Format `struct Format` Defines a format for the benchmark output when printing or writing to a file. ## Fields * ​value (`StringSlice[StaticConstantOrigin]`): The format to print results. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `csv` `alias csv = __init__[__mlir_type.!kgen.string]("csv")` Comma separated values with no alignment. ### `table` `alias table = __init__[__mlir_type.!kgen.string]("table")` Table format with dynamically aligned columns. ### `tabular` `alias tabular = __init__[__mlir_type.!kgen.string]("tabular")` Comma separated values with dynamically aligned columns. ## Methods ### `__init__` `@implicit` `__init__(out self, value: StringSlice[origin])` Constructs a Format object from a string. **Args:** * ​value (`StringSlice[origin]`): The format to print results. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two Format objects are equal. **Args:** * ​other (`Self`): The `Format` to compare with. **Returns:** True if the two `Format` objects are equal, false otherwise. ### `__str__` `__str__(self) -> String` Returns the string representation of the format. **Returns:** The string representation of the format. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the format to a writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The writer to write the `Format` to. --- ## format_int Provides the `hex` and `bin` functions. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`bin`](/mojo/stdlib/builtin/format_int/bin): Return the binary string representation an integral value. * [​`hex`](/mojo/stdlib/builtin/format_int/hex): Returns the hex string representation of the given integer. * [​`oct`](/mojo/stdlib/builtin/format_int/oct): Returns the octal string representation of the given integer. --- ## format_layout `format_layout[W: Writer](layout: Layout, mut writer: W)` Formats a 2D layout as a table and writes it to the specified writer. This function creates a visual representation of a 2D layout as a table showing the memory indices for each logical coordinate. **Parameters:** * ​W (`Writer`): Type parameter representing a Writer implementation. **Args:** * ​layout (`Layout`): The 2D layout to format. * ​writer (`W`): The writer to output the formatted layout to. --- ## fp8_quantization ## Functions * [​`block_reduce`](./block_reduce): * [​`matmul_dynamic_scaled_fp8`](./matmul_dynamic_scaled_fp8): * [​`quantize_dynamic_scaled_fp8`](./quantize_dynamic_scaled_fp8): * [​`quantize_fp8_kernel`](./quantize_fp8_kernel): * [​`quantize_static_scaled_fp8`](./quantize_static_scaled_fp8): --- ## FPUtils `struct FPUtils[dtype: DType, *, _constraint: NoneType = NoneType(_constrain_fp_type[::DType]())]` Collection of utility functions for working with FP values. **Constraints:** The dtype is floating point. ## Parameters * ​dtype (`DType`): The concrete FP dtype (FP32/FP64/etc). * ​\_constraint (`NoneType`): Implements the constraint. Do not pass explicitly. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `integral_type` `alias integral_type = _integral_type_of[::DType]()` The equivalent integer dtype of the float type. ### `uint_type` `alias uint_type = _unsigned_integral_type_of[::DType]()` The equivalent uint dtype of the float type. ## Methods ### `mantissa_width` `static mantissa_width() -> Int` Returns the mantissa width of a floating point type. **Returns:** The mantissa width. ### `max_exponent` `static max_exponent() -> Int` Returns the max exponent of a floating point dtype without accounting for inf representations. This is not the maximum representable exponent, which is generally equal to the exponent\_bias. **Returns:** The max exponent. ### `exponent_width` `static exponent_width() -> Int` Returns the exponent width of a floating point type. **Returns:** The exponent width. ### `mantissa_mask` `static mantissa_mask() -> Int` Returns the mantissa mask of a floating point type. **Returns:** The mantissa mask. ### `exponent_bias` `static exponent_bias() -> Int` Returns the exponent bias of a floating point type. **Returns:** The exponent bias. ### `sign_mask` `static sign_mask() -> Int` Returns the sign mask of a floating point type. It is computed by `1 ### `exponent_mask` `static exponent_mask() -> Int` Returns the exponent mask of a floating point type. It is computed by `~(sign_mask | mantissa_mask)`. **Returns:** The exponent mask. ### `exponent_mantissa_mask` `static exponent_mantissa_mask() -> Int` Returns the exponent and mantissa mask of a floating point type. It is computed by `exponent_mask | mantissa_mask`. **Returns:** The exponent and mantissa mask. ### `quiet_nan_mask` `static quiet_nan_mask() -> Int` Returns the quiet NaN mask for a floating point type. The mask is defined by evaluating: ``` (1 ### `bitcast_to_integer` `static bitcast_to_integer(value: SIMD[dtype, 1]) -> Int` Bitcasts the floating-point value to an integer. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point type. **Returns:** An integer representation of the floating-point value. ### `bitcast_to_uint` `static bitcast_to_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]` Bitcasts the floating-point value to an integer. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point type. **Returns:** An integer representation of the floating-point value. ### `bitcast_from_integer` `static bitcast_from_integer(value: Int) -> SIMD[dtype, 1]` Bitcasts the floating-point value from an integer. **Args:** * ​value (`Int`): The int value. **Returns:** An floating-point representation of the Int. ### `get_sign` `static get_sign(value: SIMD[dtype, 1]) -> Bool` Returns the sign of the floating point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point type. **Returns:** Returns True if the sign is set and False otherwise. ### `set_sign` `static set_sign(value: SIMD[dtype, 1], sign: Bool) -> SIMD[dtype, 1]` Sets the sign of the floating point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. * ​sign (`Bool`): True to set the sign and false otherwise. **Returns:** Returns the floating point value with the sign set. ### `get_exponent` `static get_exponent(value: SIMD[dtype, 1]) -> Int` Returns the exponent bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** Returns the exponent bits. ### `get_exponent_biased` `static get_exponent_biased(value: SIMD[dtype, 1]) -> Int` Returns the biased exponent of the floating-point value as an Int, this is how the value is stored before subtracting the exponent bias. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** The biased exponent as an Int. ### `set_exponent` `static set_exponent(value: SIMD[dtype, 1], exponent: Int) -> SIMD[dtype, 1]` Sets the exponent bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. * ​exponent (`Int`): The exponent bits. **Returns:** Returns the floating-point value with the exponent bits set. ### `get_mantissa` `static get_mantissa(value: SIMD[dtype, 1]) -> Int` Gets the mantissa bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** The mantissa bits. ### `get_mantissa_uint` `static get_mantissa_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]` Gets the mantissa bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. **Returns:** The mantissa bits. ### `set_mantissa` `static set_mantissa(value: SIMD[dtype, 1], mantissa: Int) -> SIMD[dtype, 1]` Sets the mantissa bits of the floating-point value. **Args:** * ​value (`SIMD[dtype, 1]`): The floating-point value. * ​mantissa (`Int`): The mantissa bits. **Returns:** Returns the floating-point value with the mantissa bits set. ### `pack` `static pack(sign: Bool, exponent: Int, mantissa: Int) -> SIMD[dtype, 1]` Construct a floating-point value from its constituent sign, exponent, and mantissa. **Args:** * ​sign (`Bool`): The sign of the floating-point value. * ​exponent (`Int`): The exponent of the floating-point value. * ​mantissa (`Int`): The mantissa of the floating-point value. **Returns:** Returns the floating-point value. --- ## frexp `frexp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> StaticTuple[SIMD[dtype, width], 2]` Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input values. **Returns:** A tuple of two SIMD vectors containing the fractional and exponent parts of the input floating point values. --- ## fsm32 `fsm32(gpr: Int)` Float32 matrix multiply and subtract. --- ## fsm64 `fsm64(gpr: Int)` Float64 matrix multiply and subtract. --- ## fstat Implements file system status operations. You can import these APIs from the `os` package. For example: ```mojo from os import stat ``` ## Structs * [​`stat_result`](/mojo/stdlib/os/fstat/stat_result): Object whose fields correspond to the members of the stat structure. ## Functions * [​`lstat`](/mojo/stdlib/os/fstat/lstat): Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks). * [​`stat`](/mojo/stdlib/os/fstat/stat): Get the status of a file or a file descriptor. --- ## func_attribute GPU Kernel Function Attributes Module This module provides structures for defining and managing GPU kernel function attributes. It implements functionality similar to CUDA's CUfunction\_attribute enum, allowing for querying and setting various attributes that control kernel execution behavior and resource allocation. The module includes: * `Attribute`: A value type representing different GPU kernel function attribute types * `FuncAttribute`: A structure that pairs an attribute type with its value These structures enable fine-grained control over GPU kernel execution parameters such as shared memory allocation, cache behavior, and cluster configuration. ## Structs * [​`Attribute`](/mojo/stdlib/gpu/host/func_attribute/Attribute): Represents GPU kernel function attributes. * [​`FuncAttribute`](/mojo/stdlib/gpu/host/func_attribute/FuncAttribute): Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes. --- ## FuncAttribute `@register_passable(trivial)` `struct FuncAttribute` Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes. This struct represents function attributes that can be set or queried for GPU kernels, following NVIDIA's CUDA driver API conventions. Each attribute consists of a type (represented by the Attribute enum) and an associated value. The struct provides factory methods for creating common attribute configurations, such as cache mode settings and shared memory allocations. Reference: ## Fields * ​attribute (`Attribute`): The type of function attribute. * ​value (`SIMD[int32, 1]`): The value associated with this attribute. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NULL` `alias NULL = FuncAttribute(Attribute(__init__[__mlir_type.!pop.int_literal](-1)), __init__[__mlir_type.!pop.int_literal](-1))` A null/invalid function attribute constant. ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `FuncAttribute` instances are equal. **Args:** * ​other (`Self`): The FuncAttribute to compare with. **Returns:** True if both the attribute type and value are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `FuncAttribute` instances are not equal. **Args:** * ​other (`Self`): The `FuncAttribute` to compare with. **Returns:** True if either the attribute type or value differs, False otherwise. ### `CACHE_MODE_CA` `static CACHE_MODE_CA(val: Bool) -> Self` Creates a CACHE\_MODE\_CA function attribute. Indicates whether the function has been compiled with user specified option `CacheMode.L1_CACHE_DISABLED` set. **Args:** * ​val (`Bool`): Boolean value indicating if L1 cache is disabled. **Returns:** A `FuncAttribute` instance with CACHE\_MODE\_CA attribute type. ### `MAX_DYNAMIC_SHARED_SIZE_BYTES` `static MAX_DYNAMIC_SHARED_SIZE_BYTES(val: SIMD[uint32, 1]) -> Self` Creates a MAX\_DYNAMIC\_SHARED\_SIZE\_BYTES function attribute. The maximum size in bytes of dynamically-allocated shared memory that can be used by this function. If the user-specified dynamic shared memory size is larger than this value, the launch will fail. **Args:** * ​val (`SIMD[uint32, 1]`): Maximum dynamic shared memory size in bytes. **Returns:** A `FuncAttribute` instance with `MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute type. ### `PREFERRED_SHARED_MEMORY_CARVEOUT` `static PREFERRED_SHARED_MEMORY_CARVEOUT(val: SIMD[int32, 1]) -> Self` Creates a PREFERRED\_SHARED\_MEMORY\_CARVEOUT function attribute. On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory. **Args:** * ​val (`SIMD[int32, 1]`): Shared memory carveout preference as a percentage (0-100). **Returns:** A FuncAttribute instance with `PREFERRED_SHARED_MEMORY_CARVEOUT` attribute type. --- ## Function calling and tool use import TutorialStack from '@site/src/components/TutorialStack'; Function calling enables AI models to dynamically interact with external systems, retrieve up-to-date data, and execute tasks. This capability is a foundational building block for agentic GenAI applications, where models call different functions to achieve specific objectives. ## When to use function calling You may want to define functions for the following purposes: - **To fetch data**: Access APIs, knowledge bases, or external services to retrieve up-to-date information and augment model responses - **To perform actions**: Execute predefined tasks like modifying application states, invoking workflows, or integrating with custom business logic Based on the system prompt and messages, the model may decide to call these functions instead of or in addition to generating text. Developers then handle the function calls, execute them, and return the results to the model, which integrates the function call results into its final response. ## How function calling works MAX supports the [OpenAI function calling specification](https://platform.openai.com/docs/guides/function-calling) to call developer-defined functions as tools that a model can use to augment prompts, in order to have more control over model behavior and directly trigger actions based on user input. The following example defines a function, registers that function as a tool, and sends a request to the chat completion client. :::note MAX does not currently support streaming with function calling. Be sure to set `stream` to `False` when making requests with function calling. ::: ```python from openai import OpenAI import json client = OpenAI(base_url="http://localhost:8000/v1", api_key="") # Define a function that the model can call def get_weather(location: str): return f"Getting the weather for {location} ..." # Register your function as an available tool tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City and state, e.g., 'Los Angeles, CA'" } }, "required": [ "location" ] } } }] # Generate a response with the chat completion client with access to tools response = client.chat.completions.create( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[{"role": "user", "content": "What's the weather like in Paris today?"}], tools=tools, stream=False ) # Print the model's selected function call print(completion.choices[0].message.tool_calls) ``` At this stage of the function calling workflow, the model responds with the selected tool to use along with detected function inputs: ```json [{ "id": "call_12345xyz", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\":\"Paris, France\"}" } }] ``` From here, you must execute the function call and supply the model with the results in order to augment the model response. The OpenAI function calling spec is compatible with multiple agent frameworks, such as [AutoGen](https://github.com/microsoft/autogen), [CrewAI](https://github.com/crewAIInc/crewAI), and more. ### Supported models The `max` CLI supports several LLMs optimized for function calling: - [`modularai/Llama-3.1-8B-Instruct-GGUF`](https://huggingface.co/modularai/Llama-3.1-8B-Instruct-GGUF) - [Meta's Llama 3.1 models & evals](https://huggingface.co/collections/meta-llama/metas-llama-31-models-and-evals-675bfd70e574a62dd0e40565) collection - [Meta's Llama 3.2 language models & evals](https://huggingface.co/collections/meta-llama/metas-llama-32-language-models-and-evals-675bfd70e574a62dd0e40586) collection :::note The Meta Llama 3 models are hosted in gated repositories on Hugging Face. You must have a Hugging Face account with access to these repositories and an access token configured in your environment to deploy these models. ::: ## Quickstart Use MAX to serve a model that is compatible with function calling and test it out locally. :::note Function calling is enabled by default with MAX. However, function calling with MAX is model-dependent and will only produce valid output if the model is pretrained to return tool use responses. This example uses the Modular implementation of Llama 3.1. For more information on which models to use, see [Supported models](/max/serve/function-calling#supported-models). ::: 1. Follow the steps to [set up your project](/max/get-started#set-up-your-project) to set up a GenAI endpoint. 2. Next, open a new window and send a request to the endpoint specifying the available tools: ```bash curl -N http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "stream": false, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the weather like in Boston today?"} ], "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. Los Angeles, CA" } }, "required": ["location"] } } } ], "tool_choice": "auto" }' ``` Within the generated response, you should see that the `get_weather` function was chosen to call as a tool and the inputs for the function are taken from the original prompt. ```json "tool_calls": [ { "id": "call_ac73df14fe184349", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"Boston, MA\"}" } } ] ``` ## Next steps Now that you know the basics of function calling, you can get started with MAX on GPUs. export const tutorials = [ 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes', ]; --- ## functional Implements higher-order functions. You can import these APIs from the `algorithm` package. For example: ```mojo from algorithm import map ``` ## Aliases ### `BinaryTile1DTileUnitFunc` `alias BinaryTile1DTileUnitFunc = fn[Int](Int, Int) capturing -> None` Signature of a tiled function that performs some work with a dynamic tile size and a secondary static tile size. ### `Dynamic1DTileUnitFunc` `alias Dynamic1DTileUnitFunc = fn(Int, Int) capturing -> None` Signature of a 1d tiled function that performs some work with a dynamic tile size and an offset. i.e. func(offset: Int, tile\_size: Int) ### `Dynamic1DTileUnswitchUnitFunc` `alias Dynamic1DTileUnswitchUnitFunc = fn[Bool](Int, Int, Int) capturing -> None` ### `Static1DTileUnitFunc` `alias Static1DTileUnitFunc = fn[Int](Int) capturing -> None` Signature of a 1d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset: Int) ### `Static1DTileUnitFuncWithFlag` `alias Static1DTileUnitFuncWithFlag = fn[Int, Bool](Int) capturing -> None` ### `Static1DTileUnitFuncWithFlags` `alias Static1DTileUnitFuncWithFlags = fn[Int, Bool, Bool](Int) capturing -> None` ### `Static1DTileUnswitchUnitFunc` `alias Static1DTileUnswitchUnitFunc = fn[Int, Bool](Int, Int) capturing -> None` Signature of a tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset: Int) ### `Static2DTileUnitFunc` `alias Static2DTileUnitFunc = fn[Int, Int](Int, Int) capturing -> None` Signature of a 2d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset\_x: Int, offset\_y: Int) ### `stencil` `alias stencil = _stencil_impl_cpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::Int,::Int,::IndexList[$7, ::DType` ### `stencil_gpu` `alias stencil_gpu = _stencil_impl_gpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::Int,::Int,::IndexList[$7, ::DType` ### `SwitchedFunction` `alias SwitchedFunction = fn[Bool]() raises capturing -> None` ### `SwitchedFunction2` `alias SwitchedFunction2 = fn[Bool, Bool]() capturing -> None` ## Functions * [​`elementwise`](/mojo/stdlib/algorithm/functional/elementwise): Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed. * [​`map`](/mojo/stdlib/algorithm/functional/map): Maps a function over a range from 0 to size. * [​`parallelize`](/mojo/stdlib/algorithm/functional/parallelize): Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete. * [​`parallelize_over_rows`](/mojo/stdlib/algorithm/functional/parallelize_over_rows): Parallelize func over non-axis dims of shape. * [​`sync_parallelize`](/mojo/stdlib/algorithm/functional/sync_parallelize): Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete. * [​`tile`](/mojo/stdlib/algorithm/functional/tile): A generator that launches work groups in specified list of tile sizes. * [​`tile_and_unswitch`](/mojo/stdlib/algorithm/functional/tile_and_unswitch): Performs time and unswitch functional transformation. * [​`tile_middle_unswitch_boundaries`](/mojo/stdlib/algorithm/functional/tile_middle_unswitch_boundaries): Divides 1d iteration space into three parts and tiles them with different steps. * [​`unswitch`](/mojo/stdlib/algorithm/functional/unswitch): Performs a functional unswitch transformation. * [​`vectorize`](/mojo/stdlib/algorithm/functional/vectorize): Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations. --- ## Functions As mentioned in the [syntax overview](/mojo/manual/basics), Mojo supports two keywords to declare functions: `def` and `fn`. You can use either declaration with any function, including the `main()` function, but they have different default behaviors, as described on this page. We believe both `def` and `fn` have good use cases and don't consider either to be better than the other. Deciding which to use is a matter of personal taste as to which style best fits a given task. :::note Functions declared inside a [`struct`](/mojo/manual/structs) are called "methods," but they have all the same qualities as "functions" described here. ::: ## Anatomy of a function Both `def` and `fn` function declarations have the same basic components (here demonstrated with a `def` function): def function_name[ ​ parameters ... ]( ​ arguments ... ) -> return_value_type: ​ function_body Functions can have: - Parameters: A function can optionally take one or more compile-time _parameter_ values used for metaprogramming. - Arguments: A function can also optionally take one or more run-time _arguments_. - Return value: A function can optionally return a value. - Function body: Statements that are executed when you call the function. Function definitions must include a body. All of the optional parts of the function can be omitted, so the minimal function is something like this: ```mojo def do_nothing(): pass ``` If a function takes no parameters, you can omit the square brackets, but the parentheses are always required. Although you can't leave out the function body, you can use the `pass` statement to define a function that does nothing. ### Arguments and parameters Functions take two kinds of inputs: _arguments_ and _parameters_. Arguments are familiar from many other languages: they are run-time values passed into the function. ```mojo def add(a: Int, b: Int) -> Int: return a+b ``` On the other hand, you can think of a parameter as a compile-time variable that becomes a run-time constant. For example, consider the following function with a parameter: ```mojo def add_tensors[rank: Int](a: MyTensor[rank], b: MyTensor[rank]) -> MyTensor[rank]: # ... ``` In this case, the `rank` value needs to be specified in a way that can be determined at compilation time, such as a literal or expression. When you compile a program that uses this code, the compiler produces a unique version of the function for each unique `rank` value used in the program, with `rank` treated as a constant within each specialized version. This usage of "parameter" is probably different from what you're used to from other languages, where "parameter" and "argument" are often used interchangeably. In Mojo, "parameter" and "parameter expression" refer to compile-time values, and "argument" and "expression" refer to run-time values. By default, both arguments and parameters can be specified either by position or by keyword. These forms can also be mixed in the same function call. ```mojo # positional x = add(5, 7) # Positionally, a=5 and b=7 # keyword y = add(b=3, a=9) # mixed z = add(5, b=7) # Positionally, a=5 ``` For more information on arguments, see [Function arguments](#function-arguments) on this page. For more information on parameters, see [Parameterization: compile-time metaprogramming](/mojo/manual/parameters/). ## `def` and `fn` comparison Defining a function using `def` and `fn` have much in common. They both have the following requirements: * You must declare the type of each function parameter and argument. * If a function doesn't return a value, you can either omit the return type or declare `None` as the return type. ```mojo # The following function definitions are equivalent def greet(name: String): print("Hello," name) def greet(name: String) -> None: print("Hello," name) ``` * If the function returns a value, you must either declare the return type using the -> type syntax or provide a [named result](#named-results) in the argument list. ```mojo # The following function definitions are equivalent def incr(a: Int) -> Int: return a + 1 def incr(a: Int, out b: Int): b = a + 1 ``` For more information, see the [Return values](#return-values) section of this page. Where `def` and `fn` differ is error handling and argument mutability defaults. * The compiler doesn't allow a function declared with `fn` to raise an error condition unless it explicitly includes a `raises` declaration. In contrast, the compiler assumes that *all* functions declared with `def` *might* raise an error. See the [Raising and non-raising functions](#raising-and-non-raising-functions) section of this page for more information. * All arguments to a function declared with `fn` are immutable references by default (that is, values are read-only, using the `read` [argument convention](/mojo/manual/values/ownership#argument-conventions)). This prevents accidental mutations, and permits the use of non-copyable types as arguments. All arguments to a function declared with `def` are mutable. Arguments default to using the `read` [argument convention](/mojo/manual/values/ownership#argument-conventions) like an `fn` function, with a special addition: if the function mutates the argument, it makes a mutable copy. You can override the default behavior for both `def` and `fn` functions by providing an explicit [argument convention](/mojo/manual/values/ownership#argument-conventions) when declaring the argument. As far as a function caller is concerned, there is no difference between invoking a function declared with `def` vs a function declared with `fn`. You could reimplement a `def` function as an `fn` function without making any changes to code that calls the function. ## Function arguments As noted in the previous section, there is a difference between how `def` and `fn` functions handle default *argument conventions*. Argument conventions are discussed in much more detail in the page on [Ownership](/mojo/manual/values/ownership#argument-conventions). The remaining rules for arguments described in this section apply to both `def` and `fn` functions. :::note Functions with \`/\` and \`*\` in the argument list You might see the following characters in place of arguments: slash (`/`) and/or star (`*`). For example: ```mojo def myfunc(pos_only, /, pos_or_keyword, *, keyword_only): ``` Arguments **before** the `/` can be passed only by position. Arguments **after** the `*` can be passed only by keyword. For details, see [Positional-only and keyword-only arguments](#positional-only-and-keyword-only-arguments) You may also see argument names prefixed with one or two stars (`*`): ```mojo def myfunc2(*names, **attributes): ``` An argument name prefixed by a single star character, like `*names` identifies a [variadic argument](#variadic-arguments), while an argument name prefixed with a double star, like `**attributes` identifies a [variadic keyword-only argument](#variadic-keyword-arguments). ::: ### Optional arguments An optional argument is one that includes a default value, such as the `exp` argument here: ```mojo fn my_pow(base: Int, exp: Int = 2) -> Int: return base ** exp fn use_defaults(): # Uses the default value for `exp` var z = my_pow(3) print(z) ``` However, you can't define a default value for an argument that's declared with the [`mut`](/mojo/manual/values/ownership#mutable-arguments-mut) argument convention. Any optional arguments must appear after any required arguments. [Keyword-only arguments](#positional-only-and-keyword-only-arguments), discussed later, can also be either required or optional. ### Keyword arguments You can also use keyword arguments when calling a function. Keyword arguments are specified using the format argument_name = argument_value. You can pass keyword arguments in any order: ```mojo fn my_pow(base: Int, exp: Int = 2) -> Int: return base ** exp fn use_keywords(): # Uses keyword argument names (with order reversed) var z = my_pow(exp=3, base=2) print(z) ``` ### Variadic arguments Variadic arguments let a function accept a variable number of arguments. To define a function that takes a variadic argument, use the variadic argument syntax *argument_name: ```mojo fn sum(*values: Int) -> Int: var sum: Int = 0 for value in values: sum = sum + value return sum ``` The variadic argument `values` here is a placeholder that accepts any number of passed positional arguments. You can define zero or more arguments before the variadic argument. When calling the function, any remaining positional arguments are assigned to the variadic argument, so any arguments declared **after** the variadic argument can only be specified by keyword (see [Positional-only and keyword-only arguments](#positional-only-and-keyword-only-arguments)). Variadic arguments can be divided into two categories: * Homogeneous variadic arguments, where all of the passed arguments are the same type—all `Int`, or all `String`, for example. * Heterogeneous variadic arguments, which can accept a set of different argument types. The following sections describe how to work with homogeneous and heterogeneous variadic arguments. :::note Variadic parameters Mojo also supports variadic *parameters*, but with some limitations—for details see [variadic parameters](/mojo/manual/parameters/#variadic-parameters). ::: #### Homogeneous variadic arguments When defining a homogeneous variadic argument, use *argument_name: argument_type: ```mojo def greet(*names: String): ... ``` Inside the function body, the variadic argument is available as an iterable list for ease of use. Currently there are some differences in handling the list depending on whether the arguments are register-passable types (such as `Int`) or memory-only types (such as `String`). :::note TODO We hope to remove these differences in the future. ::: Register-passable types, such as `Int`, are available as a [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList) type. As shown in the previous example, you can iterate over the values using a `for..in` loop. ```mojo fn sum(*values: Int) -> Int: var sum: Int = 0 for value in values: sum = sum+value return sum ``` Memory-only types, such as `String`, are available as a [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem). Iterating over this list directly with a `for..in` loop currently produces a [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to each value instead of the value itself. You must add an empty subscript operator `[]` to dereference the pointer and retrieve the value: ```mojo def make_worldly(mut *strs: String): # Requires extra [] to dereference the pointer for now. for i in strs: i[] += " world" ``` Alternately, subscripting into a `VariadicListMem` returns the argument value, and doesn't require any dereferencing: ```mojo fn make_worldly(mut *strs: String): # This "just works" as you'd expect! for i in range(len(strs)): strs[i] += " world" ``` #### Heterogeneous variadic arguments Implementing heterogeneous variadic arguments is somewhat more complicated than homogeneous variadic arguments. Writing generic code to handle multiple argument types requires [traits](/mojo/manual/traits) and [parameters](/mojo/manual/parameters/). So the syntax may look a little unfamiliar if you haven't worked with those features. The signature for a function with a heterogeneous variadic argument looks like this: ```mojo def count_many_things[*ArgTypes: Intable](*args: *ArgTypes): ... ``` The parameter list, `[*ArgTypes: Intable]` specifies that the function takes an `ArgTypes` parameter, which is a list of types, all of which conform to the [`Intable`](/mojo/stdlib/builtin/int/Intable) trait. The argument list, `(*args: *ArgTypes)` has the familiar `*args` for the variadic argument, but instead of a single type, its type is defined as *list* of types, `*ArgTypes`. This means that each argument in `args` has a corresponding type in `ArgTypes`, so args[n] is of type ArgTypes[n]. Inside the function, `args` is available as a [`VariadicPack`](/mojo/stdlib/builtin/list_literal/VariadicPack). The easiest way to work with the arguments is to use the `each()` method to iterate through the `VariadicPack`: ```mojo fn count_many_things[*ArgTypes: Intable](*args: *ArgTypes) -> Int: var total = 0 @parameter fn add[Type: Intable](value: Type): total += Int(value) args.each[add]() return total print(count_many_things(5, 11.7, 12)) ``` ```output 28 ``` In the example above, the `add()` function is called for each argument in turn, with the appropriate `value` and `Type` values. For instance, `add()` is first called with `value=5` and `Type=Int`, then with `value=11.7` and `Type=Float64`. Also, note that when calling `count_many_things()`, you don't actually pass in a list of argument types. You only need to pass in the arguments, and Mojo generates the `ArgTypes` list itself. As a small optimization, if your function is likely to be called with a single argument frequently, you can define your function with a single argument followed by a variadic argument. This lets the simple case bypass populating and iterating through the `VariadicPack`. For example, given a `print_string()` function that prints a single string, you could re-implement the variadic `print()` function with code like this: ```mojo fn print_string(s: String): print(s, end="") fn print_many[T: Stringable, *Ts: Stringable](first: T, *rest: *Ts): print_string(String(first)) @parameter fn print_elt[T: Stringable](a: T): print_string(" ") print_string(String(a)) rest.each[print_elt]() print_many("Bob") ``` ```output Bob ``` If you call `print_many()` with a single argument, it calls `print_string()` directly. The `VariadicPack` is empty, so `each()` returns immediately without calling the `print_elt()` function. #### Variadic keyword arguments Mojo functions also support variadic keyword arguments (`**kwargs`). Variadic keyword arguments allow the user to pass an arbitrary number of keyword arguments. To define a function that takes a variadic keyword argument, use the variadic keyword argument syntax **kw_argument_name: ```mojo fn print_nicely(**kwargs: Int) raises: for key in kwargs.keys(): print(key[], "=", kwargs[key[]]) # prints: # `a = 7` # `y = 8` print_nicely(a=7, y=8) ``` In this example, the argument name `kwargs` is a placeholder that accepts any number of keyword arguments. Inside the body of the function, you can access the arguments as a dictionary of keywords and argument values (specifically, an instance of [`OwnedKwargsDict`](/mojo/stdlib/collections/dict/OwnedKwargsDict)). There are currently a few limitations: * Variadic keyword arguments are always implicitly treated as if they were declared with the `owned` [argument convention](/mojo/manual/values/ownership#argument-conventions), and can't be declared otherwise: ```mojo # Not supported yet. fn read_var_kwargs(read **kwargs: Int): ... ``` * All the variadic keyword arguments must have the same type, and this determines the type of the argument dictionary. For example, if the argument is `**kwargs: Float64` then the argument dictionary will be a `OwnedKwargsDict[Float64]`. * The argument type must conform to both the [`Movable`](/mojo/stdlib/builtin/value/Movable) and [`Copyable`](/mojo/stdlib/builtin/value/Copyable) traits. * Dictionary unpacking is not supported yet: ```mojo fn takes_dict(d: Dict[String, Int]): print_nicely(**d) # Not supported yet. ``` * Variadic keyword *parameters* are not supported yet: ```mojo # Not supported yet. fn var_kwparams[**kwparams: Int](): ... ``` ### Positional-only and keyword-only arguments When defining a function, you can restrict some arguments so that they can be passed only as positional arguments, or they can be passed only as keyword arguments. To define positional-only arguments, add a slash character (`/`) to the argument list. Any arguments before the `/` are positional-only: they can't be passed as keyword arguments. For example: ```mojo fn min(a: Int, b: Int, /) -> Int: return a if a Int: var product = a1 * a2 if double: return product * 2 else: return product ``` Keyword-only arguments often have default values, but this is not required. If a keyword-only argument doesn't have a default value, it is a *required keyword-only argument*. It must be specified, and it must be specified by keyword. Any required keyword-only arguments must appear in the signature before any optional keyword-only arguments. That is, arguments appear in the following sequence a function signature: * Required positional arguments. * Optional positional arguments. * Variadic arguments. * Required keyword-only arguments. * Optional keyword-only arguments. * Variadic keyword arguments. For more information on keyword-only arguments, see [PEP 3102 – Keyword-Only Arguments](https://peps.python.org/pep-3102/). ## Overloaded functions All function declarations must specify argument types, so if you want a want a function to work with different data types, you need to implement separate versions of the function that each specify different argument types. This is called "overloading" a function. For example, here's an overloaded `add()` function that can accept either `Int` or `String` types: ```mojo fn add(x: Int, y: Int) -> Int: return x + y fn add(x: String, y: String) -> String: return x + y ``` If you pass anything other than `Int` or `String` to the `add()` function, you'll get a compiler error. That is, unless `Int` or `String` can implicitly cast the type into their own type. For example, `String` includes an overloaded version of its constructor (`__init__()`) that supports [implicit conversion](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion) from a `StringLiteral` value. Thus, you can also pass a `StringLiteral` to a function that expects a `String`. When resolving an overloaded function call, the Mojo compiler tries each candidate function and uses the one that works (if only one version works), or it picks the closest match (if it can determine a close match), or it reports that the call is ambiguous (if it can't figure out which one to pick). For details on how Mojo picks the best candidate, see [Overload resolution](#overload-resolution). If the compiler can't figure out which function to use, you can resolve the ambiguity by explicitly casting your value to a supported argument type. For example, the following code calls the overloaded `foo()` function, but both implementations accept an argument that supports [implicit conversion](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion) from `StringLiteral`. So, the call to `foo(string)` is ambiguous and creates a compiler error. You can fix this by casting the value to the type you really want: ```mojo @value struct MyString: @implicit fn __init__(out self, string: StringLiteral): pass fn foo(name: String): print("String") fn foo(name: MyString): print("MyString") fn call_foo(): alias string: StringLiteral = "Hello" # foo(string) # error: ambiguous call to 'foo' ... This call is ambiguous because two `foo` functions match it foo(MyString(string)) ``` Overloading also works with combinations of both `fn` and `def` function declarations. ### Overload resolution When resolving an overloaded function, Mojo does not consider the return type or other contextual information at the call site—it considers only parameter and argument types and whether the functions are instance methods or static methods. The overload resolution logic filters for candidates according to the following rules, in order of precedence: 1. Candidates requiring the smallest number of implicit conversions (in both arguments and parameters). 2. Candidates without variadic arguments. 3. Candidates without variadic parameters. 4. Candidates with the shortest parameter signature. 5. Non-`@staticmethod` candidates (over `@staticmethod` ones, if available). If there is more than one candidate after applying these rules, the overload resolution fails. For example: ```mojo @register_passable("trivial") struct MyInt: """A type that is implicitly convertible to `Int`.""" var value: Int @implicit fn __init__(out self, _a: Int): self.value = _a fn foo[x: MyInt, a: Int](): print("foo[x: MyInt, a: Int]()") fn foo[x: MyInt, y: MyInt](): print("foo[x: MyInt, y: MyInt]()") fn bar[a: Int](b: Int): print("bar[a: Int](b: Int)") fn bar[a: Int](*b: Int): print("bar[a: Int](*b: Int)") fn bar[*a: Int](b: Int): print("bar[*a: Int](b: Int)") fn parameter_overloads[a: Int, b: Int, x: MyInt](): # `foo[x: MyInt, a: Int]()` is called because it requires no implicit # conversions, whereas `foo[x: MyInt, y: MyInt]()` requires one. foo[x, a]() # `bar[a: Int](b: Int)` is called because it does not have variadic # arguments or parameters. bar[a](b) # `bar[*a: Int](b: Int)` is called because it has variadic parameters. bar[a, a, a](b) parameter_overloads[1, 2, MyInt(3)]() struct MyStruct: fn __init__(out self): pass fn foo(mut self): print("calling instance method") @staticmethod fn foo(): print("calling static method") fn test_static_overload(): var a = MyStruct() # `foo(mut self)` takes precedence over a static method. a.foo() ``` ```output foo[x: MyInt, a: Int]() bar[a: Int](b: Int) bar[*a: Int](b: Int) ``` ## Return values Return value types are declared in the signature using the -> type syntax. Values are passed using the `return` keyword, which ends the function and returns the identified value (if any) to the caller. ```mojo def get_greeting() -> String: return "Hello" ``` By default, the value is returned to the caller as an owned value. As with arguments, a return value may be [implicitly converted](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion) to the named return type. For example, the previous example calls `return` with a string literal, `"Hello"`, which is implicitly converted to a `String`. :::note Returning a reference A function can also return a mutable or immutable reference using a `ref` return value. For details, see [Lifetimes, origins, and references](/mojo/manual/values/lifetimes). ::: ### Named results Named function results allow a function to return a value that can't be moved or copied. Named result syntax lets you specify a named, uninitialized variable to return to the caller using the `out` argument convention: ```mojo def get_name_tag(owned name: String, out name_tag: NameTag): name_tag = NameTag(name^) ``` The `out` argument convention identifies an uninitialized variable that the function must initialize. (This is the same as the `out` convention used in [struct constructors](/mojo/manual/lifecycle/life#constructor).) The `out` argument for a named result can appear anywhere in the argument list, but by convention, it should be the last argument in the list. A function can declare only one return value, whether it's declared using an `out` argument or using the standard -> type syntax. A function with a named result argument doesn't need to include an explicit `return` statement, as shown above. If the function terminates without a `return`, or at a `return` statement with no value, the value of the `out` argument is returned to the caller. If it includes a `return` statement with a value, that value is returned to the caller, as usual. The fact that a function uses a named result is transparent to the caller. That is, these two signatures are interchangeable to the caller: ```mojo def get_name_tag(owned name: String) -> NameTag: ... def get_name_tag(owned name: String, out name_tag: NameTag): ... ``` In both cases, the call looks like this: ```mojo tag = get_name_tag("Judith") ``` Because the return value is assigned to this special `out` variable, it doesn't need to be moved or copied when it's returned to the caller. This means that you can create a function that returns a type that can't be moved or copied, and which takes several steps to initialize: ```mojo struct ImmovableObject: var name: String fn __init__(out self, owned name: String): self.name = name^ def create_immovable_object(owned name: String, out obj: ImmovableObject): obj = ImmovableObject(name^) obj.name += "!" # obj is implicitly returned def main(): my_obj = create_immutable_object("Blob") ``` By contrast, the following function with a standard return value doesn't work: ```mojo def create_immovable_object2(owned name: String) -> ImmovableObject: obj = ImmovableObject(name^) obj.name += "!" return obj^ # Error: ImmovableObject is not copyable or movable ``` Because `create_immovable_object2` uses a local variable to store the object while it's under construction, the return call requires it to be either moved or copied to the callee. This isn't an issue if the newly-created value is returned immediately: ```mojo def create_immovable_object3(owned name: String) -> ImmovableObject: return ImmovableObject(name^) # OK ``` ## Raising and non-raising functions By default, when a function raises an error, the function terminates immediately and the error propagates to the calling function. If the calling function doesn't handle the error, it continues to propagate up the call stack. ```mojo def raises_error(): raise Error("There was an error.") ``` The Mojo compiler *always* treats a function declared with `def` as a *raising function*, even if the body of the function doesn't contain any code that could raise an error. Functions declared with `fn` without the `raises` keyword are *non-raising functions*—that is, they are not allowed to propagate an error to the calling function. If a non-raising function calls a raising function, it **must handle any possible errors.** ```mojo # This function will not compile fn unhandled_error(): raises_error() # Error: can't call raising function in a non-raising context # Explicitly handle the error fn handle_error(): try: raises_error() except e: print("Handled an error:", e) # Explicitly propagate the error fn propagate_error() raises: raises_error() ``` If you're writing code that you expect to use widely or distribute as a package, you may want to use `fn` functions for APIs that don't raise errors to limit the number of places users need to add unnecessary error handling code. For some extremely performance-sensitive code, it may be preferable to avoid run-time error-handling. For more information, see [Errors, error handling, and context managers](/mojo/manual/errors). --- ## fused_concat `fused_concat[type: DType, rank: Int, single_thread_blocking_override: Bool, input_fn: fn[Int, Int, Int](IndexList[$2]) capturing -> SIMD[type, $1], output_0_fn: fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](axis: Int, input_shapes: StaticTuple[IndexList[rank], size], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)` --- ## fused_qk_rope `fused_qk_rope[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: Optional[DeviceContext])` --- ## fused_qk_rope ## Functions * [​`fused_qk_rope`](./fused_qk_rope): * [​`fused_qk_rope_ragged`](./fused_qk_rope_ragged): Applies RoPE (Rotary Position Embedding) to query and key tensors. * [​`get_identity_rope_coeff`](./get_identity_rope_coeff): * [​`get_safetensors_idx`](./get_safetensors_idx): * [​`rope_k_cache`](./rope_k_cache): * [​`rope_q_proj`](./rope_q_proj): --- ## fused_qk_rope_ragged `fused_qk_rope_ragged[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: Optional[DeviceContext])` Applies RoPE (Rotary Position Embedding) to query and key tensors. This function can applies RoPE only to the last `rope_dim` elements of each head, leaving the first `unroped_dim` elements unchanged. This is required for DeepSeek models where only part of each head undergoes rotary transformation. --- ## gamma `gamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Gamma of the input. For details, see . **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The Gamma function evaluated at the input. --- ## gather `gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContext)` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). `gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContextPtr = DeviceContextPtr())` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). `gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContext)` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). `gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContextPtr = DeviceContextPtr())` Gather operation as defined in . Note that this is NOT the same as the default PyTorch gather (which is equivalent to ). --- ## gather `gather[dtype: DType, size: Int, //, *, invariant: Bool = False](owned base: SIMD[index, size], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 0) -> SIMD[dtype, size]` Reads scalar values from a SIMD vector, and gathers them into one vector. The gather function reads scalar values from a SIMD vector of memory locations and gathers them into one vector. The memory locations are provided in the vector of pointers `base` as addresses. The memory is accessed according to the provided mask. The mask holds a bit for each vector lane, and is used to prevent memory accesses to the masked-off lanes. The masked-off lanes in the result vector are taken from the corresponding lanes of the `passthrough` operand. In general, for some vector of pointers `base`, mask `mask`, and passthrough `passthrough` a call of the form: ```mojo result = gather(base, mask, passthrough) ``` is equivalent to the following sequence of scalar loads in C++: ```cpp for (int i = 0; i dtype (`DType`): DType of the return SIMD buffer. * ​size (`Int`): Size of the return SIMD buffer. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​base (`SIMD[index, size]`): The vector containing memory addresses that gather will access. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of the base vector. * ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced with the passthrough vector. * ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power of two constant integer value. **Returns:** A SIMD\[dtype, size] containing the result of the gather operation. --- ## gather_elements `gather_elements[rank: Int, input_type: DType, indices_type: DType](input: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], _axis: Int, output: NDBuffer[input_type, rank, origin])` Implements ONNX GatherElements op which is equivalent to Pytorch gather. --- ## gather_elementwise_fn_wrapper `gather_elementwise_fn_wrapper[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, simd_width: Int, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1})](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], coords: IndexList[size, element_type=element_type])` --- ## gather_guards `gather_guards(axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type])` --- ## gather_nd `gather_nd[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)` GatherND operation as defined in . Based on reference implementation: . **Parameters:** * ​type (`DType`): Type of data tensor. * ​indices\_type (`DType`): Type of indices tensor. * ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1). * ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1). * ​output\_rank (`Int`): Rank of output tensor. * ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing starts from dimension of data\[batch\_dims:]. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected to be within bounds \[-s, s-1] along axis of size s. It is an error if any of the index values are out of bounds. * ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## gather_nd_shape `gather_nd_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]` Compute the output shape of a `gather` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​output\_rank (`Int`): Rank of the output tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​batch\_dims (`Int`): Batch dimensions. * ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. **Returns:** The output shape. --- ## gather_reduce `gather_reduce[type: DType, gather_axis: Int, reduce_axis: Int, simd_width: Int, reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], output_rank: Int, output_shape: DimList, input_rank: Int, input_shape: DimList, indices_rank: Int, indices_shape: DimList](output: NDBuffer[type, output_rank, origin, output_shape], input: NDBuffer[type, input_rank, origin, input_shape], indices: NDBuffer[int32, indices_rank, origin, indices_shape], reduce_init: SIMD[type, 1])` Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k]. The motivating use-case for this is multi-hot embeddings in recommender models. This provides similar functionality to Torch's EmbeddingBag layer. In that context, i is the batch dimension, j is the multi-hot dimension, and k is the embedding dimension. --- ## gather_scatter ## Structs * [​`Axis`](./Axis): ## Functions * [​`gather`](./gather): Gather operation as defined in . * [​`gather_elements`](./gather_elements): Implements ONNX GatherElements op which is equivalent to Pytorch gather. * [​`gather_elementwise_fn_wrapper`](./gather_elementwise_fn_wrapper): * [​`gather_guards`](./gather_guards): * [​`gather_nd`](./gather_nd): GatherND operation as defined in . Based on reference implementation: . * [​`gather_nd_shape`](./gather_nd_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible. * [​`gather_reduce`](./gather_reduce): Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k]. * [​`gather_shape`](./gather_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible. * [​`normalize_neg_index`](./normalize_neg_index): Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer. * [​`scatter_elements`](./scatter_elements): Implements ONNX ScatterElements op which is equivalent to Pytorch scatter. * [​`scatter_elements_shape`](./scatter_elements_shape): Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible. * [​`scatter_nd`](./scatter_nd): Scatter\_nd operation without any reduction. * [​`scatter_nd_generator`](./scatter_nd_generator): Implements ONNX ScatterND operation as defined in . * [​`scatter_nd_shape`](./scatter_nd_shape): Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible. --- ## gather_shape `gather_shape[output_rank: Int, input_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool = False](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin], axis: Int) -> IndexList[output_rank]` Compute the output shape of a `gather` operation, and assert the inputs are compatible. **Parameters:** * ​output\_rank (`Int`): Rank of the output tensor. * ​input\_rank (`Int`): Rank of the input tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. * ​axis (`Int`): The axis. **Returns:** The output shape. --- ## gcd `gcd(m: Int, n: Int, /) -> Int` Compute the greatest common divisor of two integers. **Args:** * ​m (`Int`): The first integer. * ​n (`Int`): The second integrer. **Returns:** The greatest common divisor of the two integers. `gcd(s: Span[Int, origin], /) -> Int` Computes the greatest common divisor of a span of integers. **Args:** * ​s (`Span[Int, origin]`): A span containing a collection of integers. **Returns:** The greatest common divisor of all the integers in the span. `gcd(l: List[Int, hint_trivial_type], /) -> Int` Computes the greatest common divisor of a list of integers. **Args:** * ​l (`List[Int, hint_trivial_type]`): A list containing a collection of integers. **Returns:** The greatest common divisor of all the integers in the list. `gcd(*values: Int) -> Int` Computes the greatest common divisor of a variadic number of integers. **Args:** * ​\*values (`Int`): A variadic list of integers. **Returns:** The greatest common divisor of the given integers. --- ## gelu `gelu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$. **Constraints:** Type must be a floating point type. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on. **Returns:** The result of the GELU operation. --- ## gelu_approximate `gelu_approximate[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$. **Constraints:** Type must be a floating point type. **Parameters:** * ​type (`DType`): The `DType` used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on. **Returns:** The result of the approximate GELU operation. --- ## GemmShape `@register_passable(trivial)` `struct GemmShape` Helper class to unpack gemm dimension and layout. ## Fields * ​M (`Int`): * ​N (`Int`): * ​K (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(index: IndexList[3]) -> Self` Constructor of a gemm shape record from a index tuple. **Args:** * ​index (`IndexList[3]`): The int tuple containing the index(m,n,k). ### `__getitem__` `__getitem__(self, idx: Int) -> Int` ### `__setitem__` `__setitem__(mut self, idx: Int, value: Int)` ### `__add__` `__add__(self, rhs: Self) -> Self` Coordinate-wise addition of two gemm shape records. **Args:** * ​rhs (`Self`): Another gemm shape record to add with. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Coordinate-wise subtraction of two gemm shape records. **Args:** * ​rhs (`Self`): Another gemm shape record to subtract with. ### `get` `static get[transpose_b: Bool](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self` Constructor of a gemm shape record from input buffers. M, N, and K are intentionally calculated using `a` and `c` ONLY. This is because `b` may be padded to a multiple of the tile size if it has been pre-packed. **Args:** * ​c (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer with allocated output space. * ​a (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand A. * ​b (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand B. ### `as_index` `as_index(self) -> IndexList[3]` Utility to convert the underlying data to an index tuple. So that the utilities such as elementwise add can be used. **Returns:** The constructed index tuple. --- ## gemv `gemv[parallelize: Bool, c_size: Dim, c_type: DType, a_shape: DimList, a_type: DType, b_size: Dim, b_type: DType, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c_buf: NDBuffer[c_type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[a_type, 2, origin, a_shape], b_buf: NDBuffer[b_type, 1, origin, __init__[::Intable](b_size)])` --- ## gemv ## Structs * [​`GEMVAlgorithm`](./GEMVAlgorithm): ## Functions * [​`gemv`](./gemv): * [​`gemv_gpu`](./gemv_gpu): * [​`gemv_gpu_dispatch`](./gemv_gpu_dispatch): * [​`gemv_kernel`](./gemv_kernel): * [​`gemv_kernel_vector`](./gemv_kernel_vector): * [​`gemv_split_k`](./gemv_split_k): GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K). * [​`gevm_kernel`](./gevm_kernel): * [​`gevm_tc_kernel_vector_8x`](./gevm_tc_kernel_vector_8x): * [​`naive_gemv`](./naive_gemv): * [​`reverse_idx`](./reverse_idx): --- ## gemv_gpu `gemv_gpu[transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)` --- ## gemv_gpu_dispatch `gemv_gpu_dispatch[transpose_b: Bool = False, reduction_method: ReductionMethod = ReductionMethod(1), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](kernel_func: GEMVAlgorithm, c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)` --- ## gemv_kernel `gemv_kernel[c_type: DType, a_type: DType, b_type: DType, *, reduction_method: ReductionMethod, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` --- ## gemv_kernel_vector `gemv_kernel_vector[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, *, reduction_method: ReductionMethod, simd_width: UInt, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)` --- ## gemv_split_k `gemv_split_k[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](output: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], act: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], weight: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)` GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K). The impl can actually handle M > 1 but it's only optimal fro tiny M. We use it for M = 1 only. --- ## GEMVAlgorithm `struct GEMVAlgorithm` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `GEMV_KERNEL` `alias GEMV_KERNEL = GEMVAlgorithm(0)` ### `GEMV_KERNEL_VECTOR` `alias GEMV_KERNEL_VECTOR = GEMVAlgorithm(1)` ### `GEMV_SPLIT_K` `alias GEMV_SPLIT_K = GEMVAlgorithm(2)` ### `GEVM_KERNEL` `alias GEVM_KERNEL = GEMVAlgorithm(4)` ### `GEVM_KERNEL_VECTOR` `alias GEVM_KERNEL_VECTOR = GEMVAlgorithm(3)` ### `MATMUL_NAIVE` `alias MATMUL_NAIVE = GEMVAlgorithm(5)` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__is__` `__is__(self, other: Self) -> Bool` ### `__isnot__` `__isnot__(self, other: Self) -> Bool` --- ## Generate image descriptions with Llama 3.2 Vision import SmallCards from '@site/src/components/SmallCards'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Requirements from '@site/src/components/Requirements'; import { requirementsWithGPU } from '@site/docs/max/requirements'; The MAX framework simplifies the process to create an endpoint for multimodal models that handle both text and images, such as [Llama 3.2 11B Vision Instruct](https://builds.modular.com/models/Llama-3.2-Vision-Instruct/11B), which excels at tasks such as image captioning and visual question answering. This tutorial walks you through installing the necessary tools, configuring access, and serving the model locally with an OpenAI-compatible endpoint. :::note GPU required To run the model in this tutorial, your system must have a [compatible GPU](/max/faq#gpu-requirements). ::: System requirements: ## Set up your environment Create a Python project to install our APIs and CLI tools: ## Configure Hugging Face access To get the model used below, you must have a Hugging Face user access token and approved access to the [Llama 3.2 11B Vision Instruct Hugging Face repo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). To create a Hugging Face user access token, see [Access Tokens](https://huggingface.co/settings/tokens). Within your local environment, save your access token as an environment variable. ```bash export HF_TOKEN="hf_..." ``` ## Generate a sample description You can generate an image description using the [`max generate`](/max/max-cli#generate) command. Downloading the Llama 3.2 11B Vision Instruct model weights takes some time. :::note You may need to alter the `—-max-length` and `—-max-batch-size` parameters depending on the amount of memory you have access to. The following command is optimized for a `p4d.24xlarge` instance with one NVIDIA A100 GPU and 96 vCPUs. ::: ```bash max generate \ --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ --prompt "What is in this image?" \ --image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \ --max-new-tokens 100 \ --max-batch-size 1 \ --max-length 108172 ``` When using the `max` CLI tool with multimodal input, you must provide both a `--prompt` and an `--image_url`. Additionally, the prompt should be in a valid format for the model used. For Llama 3.2 Vision 11B Instruct, you must include the `` tag in the prompt if the input includes an image to reason about. For more information about Llama 3.2 Vision prompt templates, see [Vision Model Inputs and Outputs](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/#-vision-model-inputs-and-outputs-). ## Serve the Llama 3.2 Vision model You can alternatively serve the Llama 3.2 Vision model and make multiple requests to a local endpoint. If you already tested the model with the `max generate` command, you do not have to wait for the model to download again. Serve the model with the [`max serve`](/max/max-cli/#serve) command: ```bash max serve \ --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ --max-length 108172 \ --max-batch-size 1 ``` The endpoint is ready when you see this message printed in your terminal: ```bash Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` ## Test the endpoint After the server is running, you can test it by opening a new terminal window and sending a `curl` request. :::note When making requests with `max serve`, you do not need to include model-specific image tags within your prompt. ::: The following request includes an image URL and a question to answer about the provided image: ```bash curl -N http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.2-11B-Vision-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image_url", "image_url": { "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" } } ] } ], "max_tokens": 300 }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` This sends an image along with a text prompt to the model, and you should receive a response describing the image. You can test the endpoint with any local base64-encoded image or any image URL. :::note If you make significant changes to the provided request template, you might receive less accurate responses. Some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent [nightly release](/max/packages/#nightly-release). ::: ## Next steps Now that you have successfully deployed Llama 3.2 Vision, you can: - Experiment with different images and prompts - Explore deployment configurations and additional features, such as [function calling](/max/serve/function-calling), [prefix caching](/max/serve/prefix-caching), and [structured output](/max/serve/structured-output) - Deploy the model to a containerized cloud environment for scalable serving export const cards = [ { title: 'Deploy Llama 3 on GPU with MAX Serve', link: '/max/tutorials/max-serve-local-to-cloud', description: `Learn how to deploy Llama 3 on GPU with MAX Serve.`, }, { title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters', link: '/max/tutorials/deploy-max-serve-on-kubernetes', description: `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`, }, ]; --- ## generic_cross_attention_kv_cache `generic_cross_attention_kv_cache[collection_t: KVCollectionT, type: DType, //, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], q_max_seq_len: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flare_mla_decode_kv_cache_ragged `generic_flare_mla_decode_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flare_mla_decompress_k_cache_ragged_paged `generic_flare_mla_decompress_k_cache_ragged_paged[target: StringSlice[StaticConstantOrigin], type: DType](buffer_row_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], buffer_length: SIMD[int32, 1], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], k_latent_buffer: NDBuffer[type, 2, origin, shape, strides], k_buffer: NDBuffer[type, 2, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flare_mla_prefill_kv_cache_ragged `generic_flare_mla_prefill_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, softmax_type: DType, write_softmax_info: Bool, use_cascade_attention: Bool, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], buffer_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets: NDBuffer[uint32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], softmax_info: NDBuffer[softmax_type, 3, MutableAnyOrigin], context: DeviceContextPtr, prev_output: OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))` --- ## generic_flare_mla_prefill_ragged_paged_plan `generic_flare_mla_prefill_ragged_paged_plan[target: StringSlice[StaticConstantOrigin]](input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], buffer_token_size: SIMD[uint32, 1], buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flash_attention_kv_cache_padded `generic_flash_attention_kv_cache_padded[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flash_attention_kv_cache_padded_materialized_mask `generic_flash_attention_kv_cache_padded_materialized_mask[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], mask: NDBuffer[type, rank, origin, shape, strides], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_flash_attention_kv_cache_ragged `generic_flash_attention_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_fused_qk_rope_bshd_continuous_batch `generic_fused_qk_rope_bshd_continuous_batch[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())` Performs a fused RoPE projection for Q and K projections. We have a manually fused QKV projection with mo.opaque types in our Llama model. Due to a limitation in custom op definitions, we can't declare both a tensor and opaque type as output from a custom kernel. This requires us to only note Q\_proj as an output from the QKV projection. If we immediately follow the QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition because the graph compiler doesn't know about the dependency between these kernels in the graph definition. Here we fuse the RoPE kernel applied to Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes. --- ## generic_fused_qk_rope_bshd_continuous_batch_ragged `generic_fused_qk_rope_bshd_continuous_batch_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)` --- ## generic_fused_qk_rope_bshd_paged_ragged `generic_fused_qk_rope_bshd_paged_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())` Performs a fused RoPE projection for Q and K projections. We have a manually fused QKV projection with mo.opaque types in our Llama model. Due to a limitation in custom op definitions, we can't declare both a tensor and opaque type as output from a custom kernel. This requires us to only note Q\_proj as an output from the QKV projection. If we immediately follow the QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition because the graph compiler doesn't know about the dependency between these kernels in the graph definition. Here we fuse the RoPE kernel applied to Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes. --- ## generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch `generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch[type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 3, origin, shape], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 3, origin, shape]`): Tensor with shape (batch\_size, seq\_len, num\_heads \* head\_size). * ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode]`): The historical KVCache for keys and values. The KVCache for this layer is retrieved via layer\_idx. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​output (`NDBuffer[type, 3, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_cont_batch_ragged `generic_fused_qkv_matmul_kv_cache_cont_batch_ragged[type: DType, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_paged_ragged `generic_fused_qkv_matmul_kv_cache_paged_ragged[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_paged_ragged_bias `generic_fused_qkv_matmul_kv_cache_paged_ragged_bias[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], bias: NDBuffer[type, 1, origin], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​bias (`NDBuffer[type, 1, origin]`): Bias to be added to the QKV Tensor. Tensor is concatenated q + k + v. Rank 1. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_fused_qkv_matmul_kv_cache_paged_ragged_scale `generic_fused_qkv_matmul_kv_cache_paged_ragged_scale[type: DType, weight_type: DType, output_type: DType, scale_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], input_scale: NDBuffer[scale_type, 2, origin, shape], weight_scale: NDBuffer[scale_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[output_type, 2, origin, shape], ctx: DeviceContextPtr)` Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,). The value at each index is the start\_idx of the corresponding batch in hidden\_state. * ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​input\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the input Tensor. * ​weight\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the weight Tensor. * ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode]`): The object storing the KVCache for this layer. * ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection. * ​output (`NDBuffer[output_type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V projections are written in-place to k\_cache and v\_cache. Shape: (sum(seq\_lens), num\_heads \* head\_size). * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## generic_get_continuous_cache `generic_get_continuous_cache[type: DType, kv_params: KVCacheStaticParams](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 1, origin], max_lengths: NDBuffer[uint32, 2, origin]) -> ContinuousBatchingKVCacheCollection[type, kv_params]` --- ## generic_get_paged_cache `generic_get_paged_cache[type: DType, kv_params: KVCacheStaticParams, page_size: Int](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 2, origin], max_lengths: NDBuffer[uint32, 2, origin], out result: PagedKVCacheCollection[type, kv_params, page_size])` --- ## genlut `genlut(gpr: Int)` --- ## Get started with GPU programming import GetMagic from '@site/src/includes/get_magic.mdx'; import Requirements from '@site/src/components/Requirements'; import { requirementsWithGPU } from '@site/docs/max/requirements'; This tutorial introduces you to GPU programming with Mojo. You'll learn how to write a simple program that performs vector addition on a GPU, exploring fundamental concepts of GPU programming along the way. By the end of this tutorial, you will: - Understand basic GPU programming concepts like grids and thread blocks. - Learn how to move data between CPU and GPU memory. - Write and compile a simple GPU kernel function. - Execute parallel computations on the GPU. - Understand the asynchronous nature of GPU programming. We'll build everything step-by-step, starting with the basics and gradually adding more complexity. The concepts you learn here will serve as a foundation for more advanced GPU programming with Mojo. If you just want to see the finished code, you can [get it on GitHub](https://github.com/modular/modular/tree/main/examples/mojo/gpu-intro). System requirements: ## 1. Create a Mojo project with `magic` We'll start by using the [`magic`](/magic) CLI to create a virtual environment and generate our initial project directory. 1. 2. Navigate to the directory in which you want to create the project and execute: ```bash magic init gpu-intro --format mojoproject ``` This creates a project directory named `gpu-intro`. 3. Let's go into the directory and verify the project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment: ```bash cd gpu-intro ``` ```bash magic run mojo --version ``` You should see a version string indicating the version of Mojo installed, which by default should be the latest nightly version. Because we used the `--format mojoproject` option when creating the project, `magic` automatically added the `max` package as a dependency, which includes Mojo and the MAX libraries. 4. Activate the project's virtual environment: ```bash magic shell ``` Later on, when you want to exit the virtual environment, just type `exit`. ## 2. Get a reference to the GPU device The [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext/) type represents a logical instance of a GPU device. It provides methods for allocating memory on the device, copying data between the host CPU and the GPU, and compiling and running functions (also known as *kernels*) on the device. Use the [`DeviceContext()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#__init__) constructor to get a reference to the GPU device. The constructor raises an error if no compatible GPU is available. You can use the [`has_accelerator()`](/mojo/stdlib/sys/info/has_accelerator/) function to check if a compatible GPU is available. So let's start by writing a program that checks if a GPU is available and then obtains a reference to the GPU device. Using any editor, create a file named `vector_addition.mojo` with the following code: ```mojo title="vector_addition.mojo" from gpu.host import DeviceContext from sys import has_accelerator def main(): @parameter if not has_accelerator(): print("No compatible GPU found") else: ctx = DeviceContext() print("Found GPU:", ctx.name()) ``` Save the file and run it using the `mojo` CLI: ```bash mojo vector_addition.mojo ``` You should see output like the following (depending on the type of GPU you have): ```output Found GPU: NVIDIA A10G ``` :::note Mojo requires a [compatible GPU development environment](/max/faq/#gpu-requirements) to compile kernel functions, otherwise it raises a compile-time error. In our code, we're using the [`@parameter`](/mojo/manual/decorators/parameter) decorator to evaluate the `has_accelerator()` function at compile time and compile only the corresponding branch of the `if` statement. As a result, if you don't have a compatible GPU development environment, you'll see the following message when you run the program: ```output No compatible GPU found ``` In that case, you need to find a system that has a supported GPU to continue with this tutorial. ::: ## 3. Define a simple kernel A GPU *kernel* is simply a function that runs on a GPU, executing a specific computation on a large dataset in parallel across thousands or millions of *threads*. You might already be familiar with threads when programming for a CPU, but GPU threads are different. On a CPU, threads are managed by the operating system and can perform completely independent tasks, such as managing a user interface, fetching data from a database, and so on. But on a GPU, threads are managed by the GPU itself. All the threads on a GPU execute the same kernel function, but they each work on a different part of the data. When you run a kernel, you need to specify the number of threads you want to use. The number of threads you specify depends on the size of the data you want to process and the amount of parallelism you want to achieve. A common strategy is to use one thread per element of data in the result. So if you're performing an element-wise addition of two 1,024-element vectors, you'd use 1,024 threads. A *grid* is the top-level organizational structure for the threads executing a kernel function. A grid consists of multiple *thread blocks*, which are further divided into individual threads that execute the kernel function concurrently. The GPU assigns a unique block index to each thread block, and a unique thread index to each thread within a block. Threads within the same thread block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. For this tutorial, we won't get in the details of why or how to do this, but it's an important concept to keep in mind when you're writing more complex kernels. To better understand how grids, thread blocks, and threads are organized, let's write a simple kernel function that prints the thread block and thread indices. Add the following code to your `vector_addition.mojo` file: ```mojo title="vector_addition.mojo" from gpu.id import block_idx, thread_idx fn print_threads(): """Print thread IDs.""" print("Block index: [", block_idx.x, "]\tThread index: [", thread_idx.x, "]" ) ``` :::note We're using `fn` here without the `raises` keyword because a kernel function is not allowed to raise an error condition. In contrast, when you define a Mojo function with `def`, the compiler always assumes that the function *can* raise an error condition. See the [Functions](/mojo/manual/functions) section of the Mojo Manual for more information on the difference between using `fn` and `def` to define functions in Mojo. ::: ## 4. Compile and run the kernel Next, we need to update the `main()` function to compile the kernel function for our GPU and then run it, specifying the number of thread blocks in the grid and the number of threads per thread block. For this initial example, let's define a grid consisting of 2 thread blocks, each with 64 threads. Modify the `main()` function so that your program looks like this: ```mojo title="vector_addition.mojo" from gpu.host import DeviceContext from gpu.id import block_idx, thread_idx from sys import has_accelerator fn print_threads(): """Print thread IDs.""" print("Block index: [", block_idx.x, "]\tThread index: [", thread_idx.x, "]" ) def main(): @parameter if not has_accelerator(): print("No compatible GPU found") else: ctx = DeviceContext() ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64) ctx.synchronize() print("Program finished") ``` Save the file and run it: ```bash mojo vector_addition.mojo ``` You should see something like the following output (which is abbreviated here): ```output Block index: [ 1 ] Thread index: [ 32 ] Block index: [ 1 ] Thread index: [ 33 ] Block index: [ 1 ] Thread index: [ 34 ] ... Block index: [ 0 ] Thread index: [ 30 ] Block index: [ 0 ] Thread index: [ 31 ] Program finished ``` Typical CPU-GPU interaction is asynchronous, allowing the GPU to process tasks while the CPU is busy with other work. Each `DeviceContext` has an associated stream of queued operations to execute on the GPU. Operations within a stream execute in the order they are issued. The [`enqueue_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_function) method compiles a kernel function and enqueues it to run on the given device. You must provide the name of the kernel function as a compile-time Mojo parameter, and the following arguments: - Any additional arguments specified by the kernel function definition (none, in this case). - The grid dimensions using the `grid_dim` keyword argument. - The thread block dimensions using the `block_dim` keyword argument. (See the [Functions](/mojo/manual/functions) section of the Mojo Manual for more information on Mojo function arguments and the [Parameters](/mojo/manual/parameters) section for more information on Mojo compile-time parameters and metaprogramming.) :::note Mojo currently doesn't typecheck the arguments to the compiled kernel function. This means that you can encounter obscure errors if the ordering, types, or argument count doesn't match. We're working to add more robust typechecking soon. ::: We're invoking the compiled kernel function with `grid_dim=2` and `block_dim=64`, which means we're using a grid of 2 thread blocks, with 64 threads each for a total of 128 threads in the grid. When you run a kernel, the GPU assigns each thread block within the grid to a *streaming multiprocessor* for execution. A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the threads executing on the SM, along with shared resources like registers, shared memory, and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM. Additionally, when an SM is assigned a thread block, it divides the block into multiple *warps*, which are groups of 32 or 64 threads, depending on the GPU architecture. These threads execute the same instruction simultaneously in a *single instruction, multiple threads* (SIMT) model. The SM's *warp scheduler* coordinates the execution of warps on an SM's cores. Warps are used to efficiently utilize GPU hardware by maximizing throughput and minimizing control overhead. Since GPUs are designed for high-performance parallel processing, grouping threads into warps allows for streamlined instruction scheduling and execution, reducing the complexity of managing individual threads. Multiple warps from multiple thread blocks can be active within an SM at any given time, enabling the GPU to keep execution units busy. For example, if the threads of a particular warp are blocked waiting for data from memory, the warp scheduler can immediately switch execution to another warp that's ready to run. After enqueuing the kernel function, we want to ensure that the CPU waits for it to finish execution before exiting the program. We do this by calling the [`synchronize()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#synchronize) method of the `DeviceContext` object, which blocks until the device completes all operations in its queue. ## 5. Manage grid dimensions The grid in the previous step consisted of a one-dimensional grid of 2 thread blocks with 64 threads in each block. However, you can also organize the thread blocks in a two- or even a three-dimensional grid. Similarly, you can arrange the threads in a thread block across one, two, or three dimensions. Typically, you determine the dimensions of the grid and thread blocks based on the dimensionality of the data to process. For example, you might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video. To better understand how grids, thread blocks, and threads work together, let's modify our `print_threads()` kernel function to print the `x`, `y`, and `z` components of the thread block and thread indices for each thread. ```mojo title="vector_addition.mojo" fn print_threads(): """Print thread IDs.""" print("Block index: [", block_idx.x, block_idx.y, block_idx.z, "]\tThread index: [", thread_idx.x, thread_idx.y, thread_idx.z, "]" ) ``` Then, update `main()` to enqueue the kernel function with a 2x2x1 grid of thread blocks and a 16x4x2 arrangement of threads within each thread block: ```mojo title="vector_addition.mojo" ctx.enqueue_function[print_threads]( grid_dim=(2, 2, 1), block_dim=(16, 4, 2) ) ``` Save the file and run it again: ```bash mojo vector_addition.mojo ``` You should see something like the following output (which is abbreviated here): ```output Block index: [ 1 1 0 ] Thread index: [ 0 2 0 ] Block index: [ 1 1 0 ] Thread index: [ 1 2 0 ] Block index: [ 1 1 0 ] Thread index: [ 2 2 0 ] ... Block index: [ 0 0 0 ] Thread index: [ 14 1 0 ] Block index: [ 0 0 0 ] Thread index: [ 15 1 0 ] Program finished ``` Try changing the grid and thread block dimensions to see how the output changes. :::note The maximum number of threads per thread block and threads per SM is GPU-specific. For example, the NVIDIA A100 GPU has a maximum of 1,024 threads per thread block and 2,048 threads per SM. Choosing the size and shape of the grid and thread blocks is a balancing act between maximizing the number of threads that can execute concurrently and minimizing the amount of time spent waiting for data to be loaded from memory. Factors such as the size of the data to process, the number of SMs on the GPU, and the memory bandwidth of the GPU can all play a role in determining the optimal grid and thread block dimensions. One general guideline is to choose a thread block size that is a multiple of the warp size. This helps to maximize the utilization of the GPU's resources and minimizes the overhead of managing multiple warps. ::: Now that you understand how to manage grid dimensions, let's get ready to create a kernel that performs a simple element-wise addition of two vectors of floating point numbers. ## 6. Allocate host memory for the input vectors Before creating the two input vectors for our kernel function, we need to understand the distinction between *host memory* and *device memory*. Host memory is dynamic random-access memory (DRAM) accessible by the CPU, whereas device memory is DRAM accessible by the GPU. If you have data in host memory, you must explicitly copy it to device memory before you can use it in a kernel function. Similarly, if your kernel function produces data that you want the CPU to use later, you must explicitly copy it back to host memory. For this tutorial, we'll use the [`HostBuffer`](/mojo/stdlib/gpu/host/device_context/HostBuffer) type to represent our vectors on the host. A `HostBuffer` is a block of host memory associated with a particular `DeviceContext`. It supports methods for transferring data between host and device memory, as well as a basic set of methods for accessing data elements by index and for printing the buffer. Let's update `main()` to create two `HostBuffer`s for our input vectors and initialize them with values. You won't need the `print_threads()` kernel function anymore, so you can remove it and the code to compile and invoke it. So after all that, your `vector_addition.mojo` file should look like this: ```mojo title="vector_addition.mojo" from gpu.host import DeviceContext from gpu.id import block_idx, thread_idx from sys import has_accelerator # Vector data type and size alias float_dtype = DType.float32 alias vector_size = 1000 def main(): @parameter if not has_accelerator(): print("No compatible GPU found") else: # Get the context for the attached GPU ctx = DeviceContext() # Create HostBuffers for input vectors lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype]( vector_size ) rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype]( vector_size ) ctx.synchronize() # Initialize the input vectors for i in range(vector_size): lhs_host_buffer[i] = Float32(i) rhs_host_buffer[i] = Float32(i * 0.5) print("LHS buffer: ", lhs_host_buffer) print("RHS buffer: ", rhs_host_buffer) ``` The [`enqueue_create_host_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_host_buffer) method accepts the data type as a compile-time parameter and the size of the buffer as a run-time argument and returns a `HostBuffer`. As with all `DeviceContext` methods whose name starts with `enqueue_`, the method is asynchronous and returns immediately, adding the operation to the queue to be executed by the `DeviceContext`. Therefore, we need to call the `synchronize()` method to ensure that the operation has completed before we use the `HostBuffer` object. Then we can initialize the input vectors with values and print them. Now let's run the program to verify that everything is working so far. ```bash mojo vector_addition.mojo ``` You should see the following output: ```output LHS buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0]) RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5]) ``` :::note You might notice that we don't explicitly call any methods to free the host memory allocated by our `HostBuffer`s. That's because a `HostBuffer` is subject to Mojo's standard ownership and lifecycle mechanisms. The Mojo compiler analyzes our program to determine the last point that the owner of or a reference to an object is used and automatically adds a call to the object's destructor. In our program, we last reference the buffers at the end of our program's `main()` method. However in a more complex program, the `HostBuffer` could persist across calls to multiple kernel functions if it is referenced at later points in the program. See the [Ownership](/mojo/manual/values/ownership) and [Intro to value lifecycle](/mojo/manual/lifecycle) sections of the Mojo Manual for more information on Mojo value ownership and value lifecycle management. ::: ## 7. Copy the input vectors to GPU memory and allocate an output vector Now that we have our input vectors allocated and initialized on the CPU, let's copy them to the GPU so that they'll be available for the kernel function to use. While we're at it, we'll also allocate memory on the GPU for the output vector that will hold the result of the kernel function. Add the following code to the end of the `main()` function: ```mojo title="vector_addition.mojo" # Create DeviceBuffers for the input vectors lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size) rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size) # Copy the input vectors from the HostBuffers to the DeviceBuffers ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer) ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer) # Create a DeviceBuffer for the result vector result_device_buffer = ctx.enqueue_create_buffer[float_dtype]( vector_size ) ``` The [`DeviceBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer) type is analogous to the `HostBuffer` type, but represents a block of device memory associated with a particular `DeviceContext`. Specifically, the buffer is located in the device's *global memory* space, which is accessible by all threads executing on the device. As with a `HostBuffer`, a `DeviceBuffer` is subject to Mojo's standard ownership and lifecycle mechanisms. It persists until it is no longer referenced in the program or until the `DeviceContext` itself is destroyed. The [`enqueue_create_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_buffer) method accepts the data type as a compile-time parameter and the size of the buffer as a run-time argument and returns a `DeviceBuffer`. The operation is asynchronous, but we don't need to call the `synchronize()` method yet because we have more operations to add to the queue. The [`enqueue_copy()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_copy) method is overloaded to support copying from host to device, device to host, or even device to device for systems that have multiple GPUs. In this example, we use it to copy the data in our `HostBuffer`s to the `DeviceBuffer`s. :::note Both `DeviceBuffer` and `HostBuffer` also include [`enqueue_copy_to()`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer#enqueue_copy_to) and [`enqueue_copy_from()`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer#enqueue_copy_from) methods. These are simply convenience methods that call the `enqueue_copy()` method on their corresponding `DeviceContext`. Therefore, we could have written the copy operations in the previous example with the following equivalent code: ```mojo lhs_host_buffer.enqueue_copy_to(dst=lhs_device_buffer) rhs_host_buffer.enqueue_copy_to(dst=rhs_device_buffer) ``` ::: ## 8. Create `LayoutTensor` views One last step before writing the kernel function is that we're going to create a [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) view for each of the vectors. `LayoutTensor` provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. We don't need all of these features for this tutorial, but in more complex kernels it's a useful tool for manipulating data. So even though it isn't strictly necessary for this example, we'll use `LayoutTensor` because you'll see it in more complex examples and it's good to get familiar with it. First add the following import to the top of the file: ```mojo title="vector_addition.mojo" from layout import Layout, LayoutTensor ``` A [`Layout`](/mojo/kernels/layout/layout/Layout) is a representation of memory layouts using shape and stride information, and it maps between logical coordinates and linear memory indices. We'll need to use the same `Layout` definition multiple times, so add the following alias to the top of the file after the other aliases: ```mojo title="vector_addition.mojo" alias layout = Layout.row_major(vector_size) ``` And finally add the following code to the end of the `main()` function to create `LayoutTensor` views for each of the vectors: ```mojo title="vector_addition.mojo" # Wrap the DeviceBuffers in LayoutTensors lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer) rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer) result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer) ``` ## 9. Define the vector addition kernel function Now we're ready to write the kernel function. First add the following imports (note that we've added `block_dim` to the list of imports from `gpu.id`): ```mojo title="vector_addition.mojo" from gpu.id import block_dim, block_idx, thread_idx from math import ceildiv ``` Then, add the following code to `vector_addition.mojo` just before the `main()` function: ```mojo title="vector_addition.mojo" # Calculate the number of thread blocks needed by dividing the vector size # by the block size and rounding up. alias block_size = 256 alias num_blocks = ceildiv(vector_size, block_size) fn vector_addition( lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin], rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin], out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin], ): """Calculate the element-wise sum of two vectors on the GPU.""" # Calculate the index of the vector element for the thread to process var tid = block_idx.x * block_dim.x + thread_idx.x # Don't process out of bounds elements if tid Click here to see the complete version of `vector_addition.mojo`. ```mojo title="vector_addition.mojo" from gpu.host import DeviceContext from gpu.id import block_dim, block_idx, thread_idx from layout import Layout, LayoutTensor from math import ceildiv from sys import has_accelerator # Vector data type and size alias float_dtype = DType.float32 alias vector_size = 1000 alias layout = Layout.row_major(vector_size) # Calculate the number of thread blocks needed by dividing the vector size # by the block size and rounding up. alias block_size = 256 alias num_blocks = ceildiv(vector_size, block_size) fn vector_addition( lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin], rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin], out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin], ): """Calculate the element-wise sum of two vectors on the GPU.""" # Calculate the index of the vector element for the thread to process var tid = block_idx.x * block_dim.x + thread_idx.x # Don't process out of bounds elements if tid The `enqueue_function()` method enqueues the compilation and invocation of the `vector_addition()` kernel function, passing the input and output tensors as arguments. The `grid_dim` and `block_dim` arguments use the `num_blocks` and `block_size` aliases we defined in the previous step. After the kernel function has been compiled and enqueued, we create a `HostBuffer` to hold the result vector. Then we copy the result vector from the `DeviceBuffer` to the `HostBuffer`. Finally, we synchronize the `DeviceContext` to run all enqueued operations. After synchronizing, we can print the result vector to the console. At this point, the Mojo compiler determines that the `DeviceContext`, the `DeviceBuffer`s, the `HostBuffer`s, and the `LayoutTensor`s are no longer used and so it automatically invokes their destructors to free their allocated memory. (For a detailed explanation of object lifetime and destruction in Mojo, see the [Death of a value](/mojo/manual/lifecycle/death) section of the Mojo Manual.) So it's finally time to run the program to see the results of our hard work. ```bash mojo vector_addition.mojo ``` You should see the following output: ```output LHS buffer: HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0]) RHS buffer: HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5]) Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5]) ``` And now that you're done with the tutorial, exit your project's virtual environment: ```bash exit ``` ## Summary In this tutorial, we've learned how to use Mojo's `gpu.host` package to write a simple kernel function that performs an element-wise addition of two vectors. We covered: - Understanding basic GPU concepts like devices, grids, and thread blocks. - Moving data between CPU and GPU memory. - Writing and compiling a GPU kernel function. - Executing parallel computations on the GPU. ## Next steps Now that you understand the basics of GPU programming with Mojo, here are some suggested next steps: - Check out more [examples](https://github.com/modular/modular/tree/main/examples/gpu_functions) of GPU programming with Mojo in the public [Modular GitHub repository](https://github.com/modular/modular). - Learn more about GPU programming in Mojo and practice your skills by solving the [Mojo GPU puzzles](https://builds.modular.com/puzzles). - Read the [GPU basics](/mojo/manual/gpu/basics) section of the Mojo Manual to find out more about GPU programming in Mojo. - Read the [Introduction to layouts](/mojo/manual/layout/layouts) section of the Mojo Manual to learn more about the `layout` package and managing layouts. - Check out the [Mojo Manual](/mojo/manual) for more information on the Mojo language. - Learn more about other features of the [Modular platform](/max/intro) for building and deploying high-performance AI endpoints. import TutorialStack from '@site/src/components/TutorialStack'; export const maxTutorials = [ 'build-custom-ops', 'magic', ]; export const mojoTutorials = [ 'get-started', ]; --- ## Get started with Magic import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import SmallCards from '@site/src/components/SmallCards'; import MaxInstall from '@site/src/components/MaxInstall'; Magic is a package manager and virtual environment manager for any language, including Python and Mojo. It builds upon the conda and PyPI packaging ecosystems, which provide access to thousands of packages for Python and other languages, while also adding functionality for MAX and Mojo. The `magic` CLI allows you to instantly launch code examples and create new projects that are fully contained and reproducible across systems. All the package dependencies and environment settings are magically managed for you. This page provides an introduction to basic `magic` commands. For a deep-dive into more features, see the [Magic tutorial](/max/tutorials/magic). :::note Magic is built upon [pixi](https://github.com/prefix-dev/pixi), so you'll see this name appear below. ::: ## Install Magic You can install Magic on macOS and Ubuntu with this command: Then run the `source` command that's printed in your terminal. To see the available commands, print the help: ```sh magic --help ``` ### Enable auto-completion To enable auto-completion for `magic`, run the following commands: ```sh BASHRC=$( [ -f "$HOME/.bash_profile" ] && echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" ) echo 'eval "$(magic completion --shell bash)"' >> "$BASHRC" source "$BASHRC" ``` ```sh echo 'eval "$(magic completion --shell zsh)"' >> ~/.zshrc source ~/.zshrc ``` ```sh echo 'magic completion --shell fish | source' >> ~/.config/fish/config.fish source ~/.config/fish/config.fish ``` ### Update Magic You can update with the [`self-update`](/magic/commands#magic-self-update) command: ```sh magic self-update ``` ### Uninstall Magic To remove Magic, delete the binary: ```sh rm ~/.modular/bin/magic ``` To remove packages installed for your projects, delete the corresponding project directories. ## Create a project You can create a project with its own package dependencies and virtual environment using the [`magic init`](/magic/commands#magic-init) command. By default, this creates a configuration file called `pixi.toml`, but we recommend that you specify the `--format` option as shown below, to instead create a `pyproject.toml` or `mojoproject.toml` file for enhanced Python and Mojo features, respectively. :::note MAX build versions MAX is available as either a stable or nightly build. For more detail about installing MAX versions, instead see [MAX packages](/max/packages). ::: ### Create a Python project Here's how to create a new Python project and install MAX: 1. Create a Python project with the [`magic init`](/magic/commands#magic-init) command: ```sh magic init my-project --format pyproject ``` This creates a `my-project` directory and a `pyproject.toml` file that defines the project dependencies and more. (If you omit the directory name, it creates the `pyproject.toml` file in the current directory.) 2. Enter the project directory and use [`magic run`](/magic/commands#magic-run) to execute code inside the virtual environment: ```sh cd my-project ``` ```sh magic run python3 --version ``` Or, activate the environment shell with [`magic shell`](/magic/commands#magic-run): ```sh magic shell ``` ```sh python3 --version ``` Then use `exit` to deactivate the shell: ```sh exit ``` Always exit the shell before changing projects. 3. If you want a different Python version, open the `pyproject.toml` file and edit the [version specifier](https://packaging.python.org/en/latest/specifications/version-specifiers/#id5) defined with this line: ```toml requires-python = ">= 3.11" ``` 4. To install Python packages for your project, use [`magic add`](/magic/commands#magic-add). We recommend you always specify the version, for example: ```sh magic add "max~=25.3" ``` You can run commands such as `magic add` anywhere inside a Magic project directory, whether or not you've activated the shell. For more information about using Magic for your Python projects, read [using Pixi for Python](https://pixi.sh/latest/tutorials/python/) (just replace each `pixi` command with `magic`). ### Create a Mojo project Here's how to create a new Mojo project: 1. Create a Mojo project with [`magic init`](/magic/commands#magic-init): ```sh magic init my-mojo-project --format mojoproject ``` This creates the `my-mojo-project` directory and creates a `mojoproject.toml` file inside, which defines the project dependencies and more. If you omit the path name, Magic creates the confg file in the current directory. By default, `mojoproject.toml` includes `max` as a dependency because `max` is the package that installs Mojo. 2. Enter the project and use [`magic run`](/magic/commands#magic-run) to execute code inside the environment: ```sh cd my-mojo-project ``` ```sh magic run mojo --version ``` :::note By default, `magic` creates each project and adds the latest [nightly release](/max/packages#nightly-release) of MAX/Mojo as a dependency. If you prefer to use a stable release, you can specify the version you want like this: ```sh magic add "max~=25.3" ``` ::: You can also activate the environment shell with [`magic shell`](/magic/commands#magic-run): ```sh magic shell ``` ```sh mojo --version ``` Then use `exit` to deactivate the shell before changing projects: ```sh exit ``` 3. If you want to use Python with Mojo, specify the Python version and Python packages with [`magic add`](/magic/commands#magic-add). For example: ```sh magic add "python==3.9" ``` :::caution If your Mojo project has Python package dependencies and you create an executable with [`mojo build`](/mojo/cli/build), the executable might not work outside of the Magic environment. That's because the Mojo executable doesn't include the Python packages, so they must be provided in the environment where you run it (such as inside the Magic environment where you built the executable). ::: You can run commands such as `magic add` anywhere inside a Magic project directory, whether or not you've activated the shell. ### Convert a conda project to Magic If you have an existing conda project, you can convert the `environment.yml` configuration file to Magic with this command: ```sh magic init --import environment.yml ``` :::caution You might encounter issues if you invoke `magic` within a `conda` virtual environment. It's best if you don't mix the two tools. ::: ## Manage packages You can add Python and Mojo packages to your project by running [`magic add`](/magic/commands/#magic-add) inside your project directory (every project has its own package versions). For example: ```sh magic add "max~=25.3" "numpy= 3.11" ``` If you created a [Mojo project](#create-a-mojo-project), you can modify the Python version like any other package dependency: ```sh magic add "python==3.10" ``` The next time you run a `magic` command, it updates Python with the appropriate version: ```sh magic run python3 --version ``` ```output Python 3.10 ``` ### The `magic.lock` file Although the project configuration file (`pixi.toml`, `pyproject.toml`, or `mojoproject.toml`) defines your project dependencies, it doesn't define the project's transitive dependencies, which are the dependencies of your project dependencies. Nor does the configuration file always specify the exact package version that is actually installed (such as when you [specify a version](https://packaging.python.org/en/latest/specifications/version-specifiers/#id5) merely as less-than ``). The transitive dependencies and actual installed versions are instead specified in the `magic.lock` file, which is automatically generated—you should not edit this file by hand. This file is crucial to ensure that you can reliably reproduce your environment across different machines. You can learn more about it from the [Pixi lock file docs](https://pixi.sh/latest/features/lockfile/). ## Known issues - You might encounter issues if you invoke `magic` within a `conda` or `venv` virtual environment. It's best if you don't mix Magic with other virtual environment tools. - If you also have `pixi` installed, it generally should work with projects you created using `magic`, but you might see some issues so we advise you only use `magic` for MAX and Mojo projects. - Linux aarch64 (ARM64) does not work with projects using PyTorch 2.2.2. - The [MAX Replit pipeline](https://github.com/modular/modular/tree/main/examples/graph-api/pipelines/replit) currently doesn't work with the max-conda package. ## More reading You can learn more about the available commands by printing the help: ```sh magic -h ``` Or see all the [Magic commands here](/magic/commands). If you have more questions, see the [Magic FAQ](/magic/faq). And because Magic is built upon `pixi`, you can also learn more from the [pixi documentation](https://pixi.sh/latest/) (just replace each `pixi` command with `magic`). However, there are several differences between `magic` and `pixi`. For example, `magic` does not support `exec`, `auth`, and `upload` commands, and probably others to come. export const cards = [ { title: 'Get started with MAX', description: 'Try one of our tutorials to deploy an LLM using MAX.', link: '/max/tutorials', }, { title: 'Get started with Mojo', description: 'Learn key features of Mojo by building an application from scratch in this hands-on tutorial.', link: '/mojo/manual/get-started', }, { title: 'A step-by-step guide to Magic', description: 'Learn how to get started and get the most out of the Magic.', link: '/max/tutorials/magic', }, ]; --- ## Get started with MAX Graph in Python import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; MAX Graph is a high-performance computation framework that lets you build and execute efficient machine learning models. It provides a flexible way to define computational workflows as graphs, where each node represents an operation (like matrix multiplication or addition) and edges represent the flow of data. By using MAX Graph, you can create optimized machine learning models that run faster and more efficiently on modern hardware. In this tutorial, you'll build a graph using the MAX Graph API in Python with an [`ops` function](/max/api/python/graph/ops). To do this, you will complete the following steps: 1. [Build a simple graph that adds two numbers](#build-the-graph) 2. [Create an inference session to load and compile the graph](#create-inference-session) 3. [Execute the graph with input data](#execute-the-graph) By the end of this tutorial, you'll have an understanding of how to construct basic computational graphs, set up inference sessions, and run computations using the MAX Graph API. ## Set up your environment Create a Python project to install our APIs and CLI tools. Then, create a working directory. Create a folder called `max_ops`: ```sh mkdir max_ops cd max_ops ``` You can check your MAX version like this: ```sh pip show modular ``` You can check your Python version like this: ```sh python --version ``` Create a folder called `max_ops`: ```sh mkdir max_ops cd max_ops ``` You can check your MAX version like this: ```sh uv pip show modular ``` You can check your Python version like this: ```sh python --version ``` Change folders to your working directory: ```sh cd src/quickstart ``` You can check your MAX version like this: ```sh magic run max --version ``` You can check your Python version like this: ```sh magic run python --version ``` If you have any questions along the way, ask them on [our Discord channel](https://discord.gg/modular). ## 1. Build the graph {#build-the-graph} Now with our environment and packages setup, lets create the graph. This graph will define a computational workflow that adds two tensors together. Let's start by creating a new file called `addition.py` inside of your working directory and add the following libraries: ```python from typing import Any import numpy as np from max import engine from max.dtype import DType from max.graph import DeviceRef, Graph, TensorType, ops ``` To create a computational graph, use the [`Graph()`](/max/api/python/graph/Graph) class from the MAX Graph API. When initializing, specify a name for the graph and define the types of inputs it will accept. ```python def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, Any]: # 1. Build the graph input_type = TensorType( dtype=DType.float32, shape=(1,), device=DeviceRef.CPU() ) with Graph( "simple_add_graph", input_types=(input_type, input_type) ) as graph: lhs, rhs = graph.inputs out = ops.add(lhs, rhs) graph.output(out) ``` Inside the context manager, access the graph's inputs using the [`inputs`](/max/api/python/graph/Graph#max.graph.Graph.inputs) property. This returns a symbolic tensor representing the input arguments. The symbolic tensor is a placeholder that represents the shape and type of data that will flow through the graph during the execution, rather than containing the actual numeric values like in eager execution. Then use the [`add()`](/max/api/python/graph/ops#max.graph.ops.add) function from the [`ops`](/max/api/python/graph/ops) package to add the two input tensors. This creates a new symbolic tensor representing the sum. Finally, set the output of the graph using the [`output()`](/max/api/python/graph/Graph#max.graph.Graph.output) method. This specifies which tensors should be returned when the graph is executed. Now, add a `print()` function to the graph to see what's created. ```python def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]: # 1. Build the graph # ... print("final graph:", graph) ``` The output will show us the structure of our graph, including the input it expects and the operations it will perform. This helps us understand how our graph will process data when we use it. Next, let's load the graph into an inference session. ## 2. Create an inference session {#create-inference-session} Now that our graph is constructed, let's set up an environment where it can operate. This involves creating an inference session and loading our graph into it. Create an [`InferenceSession()`](/max/api/python/engine#max.engine.InferenceSession) instance that loads and runs the graph inside the `add_tensors()` function. ```python def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]: # 1. Build the graph # ... # 2. Create an inference session session = engine.InferenceSession() model = session.load(graph) ``` This step transforms our abstract graph into a computational model that's ready for execution. To ensure our model is set up correctly, let's examine its input requirements. Print the graph's input metadata by using the [`input_metadata`](/max/api/python/engine#max.engine.Model.input_metadata) property. ```python def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]: # 1. Build the graph # ... # 2. Create an inference session session = engine.InferenceSession() model = session.load(graph) # highlight-start for tensor in model.input_metadata: # highlight-end print( f"name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}" ) ``` This will output the exact specifications of the input our model expects, helping us prepare appropriate data for processing. Next, let's execute the graph. ## 3. Execute the graph {#execute-the-graph} To give the model something to add, create two inputs of a shape and a data type that match our graph's input requirements. Then pass the inputs to the [`execute()`](/max/api/python/engine#max.engine.Model.execute) function: ```python def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]: # ... # 2. Create an inference session # ... # 3. Execute the graph # highlight-start ret = model.execute(a, b)[0] # highlight-end print("result:", ret) return ret ``` :::note Starting in 24.6.0, the `model.execute()` command no longer accepts keyword arguments. In a future release we will restore this functionality with support for GPUs. For compatibility with existing code that uses keyword arguments, you can use the `execute_legacy()` function. ::: ## 4. Run the example Now that we've built our graph, created an inference session, and defined how to execute the graph, let's put it all together and run our complete example. At the end of your `addition.py` file, add the following code: ```python if __name__ == "__main__": input0 = np.array([1.0], dtype=np.float32) input1 = np.array([1.0], dtype=np.float32) add_tensors(input0, input1) ``` This passes your arguments `input0` and `input1` to the `add_tensors()` function. Then, run the Python file from the command line: ```sh python addition.py ``` ```sh python addition.py ``` ```sh magic run python addition.py ``` You've successfully created your first graph using the MAX Graph API in Python. Let's examine what was printed to the terminal: ```output final graph: mo.graph @simple_add_graph(%arg0: !mo.tensor, %arg1: !mo.tensor) -> !mo.tensor attributes {argument_names = ["input0", "input1"], result_names = ["output0"]} { %0 = rmo.add(%arg0, %arg1) : (!mo.tensor, !mo.tensor) -> !mo.tensor mo.output %0 : !mo.tensor } ``` - Two input tensors (`%arg0`, `%arg1`) of shape `[1]` and float32 type - The addition operation connecting them - One output tensor of matching shape/type The metadata lines confirm both input tensors match the required specifications. ```output name: input0, shape: [1], dtype: DType.float32 name: input1, shape: [1], dtype: DType.float32 ``` The result shows the addition worked correctly: $$ [1.0] + [1.0] = [2.0] $$ ```output result: [2.] ``` Now that you've built your first MAX Graph that performs addition, you can explore more complex examples: - [MAX Graph API example](https://github.com/modular/modular/tree/main/tutorials/max-graph-python) - [MAX Graph implementation of LLama3](https://github.com/modular/modular/tree/main/max) ## Next steps import TutorialStack from '@site/src/components/TutorialStack'; export const maxTutorials = [ 'build-custom-ops', 'magic', ]; --- ## Get started with Mojo import GetMagic from '@site/src/includes/get_magic.mdx'; import Requirements from '@site/src/components/Requirements'; import { requirementsNoGPU } from '@site/docs/max/requirements'; :::tip Want to write a GPU function with Mojo? See how to [get started with GPU programming with Mojo](/mojo/manual/gpu/intro-tutorial). ::: Get ready to learn Mojo! This tutorial is designed to give you a tour of several features of Mojo by building a complete program that does much more than simply printing "Hello, world!" In fact, we'll build a version of [Conway's Game of Life](https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life), which is a simple simulation to explore self-replicating systems. If you haven't heard of it before, don't worry, it will make sense when you see it in action. Let's just get started so you can learn Mojo programming basics, including the following: - Using basic built-in types like `Int` and `String` - Using a `List` to manage a sequence of values - Creating custom types in the form of structs (data structures) - Creating and importing Mojo modules - Importing and using Python libraries This tutorial might be a little long because there's a lot to learn, but we tried to keep the explanations simple, and we included links along the way for you to go learn more about each topic. If you just want to see the finished code, you can [get it on GitHub](https://github.com/modular/modular/tree/main/examples/mojo/life). System requirements: ## 1. Create a Mojo project with `magic` We'll start by using the `magic` CLI to create a virtual environment and generate our initial project directory. 1. 2. Navigate to the directory in which you want to create the project and execute: ```bash magic init life --format mojoproject ``` This creates a project directory named `life`. 3. Let's go into the directory and list its contents: ```bash cd life ``` ```bash ls -A ``` ```output .gitattributes .gitignore .magic magic.lock mojoproject.toml ``` You should see that the project directory contains: - An initial `mojoproject.toml` manifest file, which defines the project dependencies and other features - A [lock file](/magic#the-magiclock-file) named `magic.lock`, which specifies the transitive dependencies and actual package versions installed in the project's virtual environment :::note Never edit the lock file directly. The `magic` command automatically updates the lock file if you edit the manifest file. ::: - A `.magic` subdirectory containing the conda virtual environment for the project - Initial `.gitignore` and `.gitattributes` files that you can optionally use if you plan to use `git` version control with the project Because we used the `--format mojoproject` option when creating the project, `magic` automatically added the `max` package as a dependency, which includes Mojo. Let's verify that our project is configured correctly by checking the version of Mojo that's installed within our project's virtual environment. `magic run` executes a command in the project's virtual environment, so let's use it to execute `mojo --version`: ```bash magic run mojo --version ``` You should see a version string indicating the version of Mojo installed, which by default should be the latest released version. Great! Now let's write our first Mojo program. ## 2. Create a "Hello, world" program You can use any editor or IDE that you like. If you're using [Visual Studio Code](https://code.visualstudio.com/) you can take advantage of the [Mojo for Visual Studio Code extension](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo), which provides features like syntax highlighting, code completion, and debugging support. In the project directory, create a file named `life.mojo` containing the following lines of code: ```mojo title="life.mojo" # My first Mojo program! def main(): print("Hello, World!") ``` If you've programmed before in Python, this should look familiar. - We're using the `def` keyword to define a function named `main`. - You can use any number of spaces or tabs for indentation as long as you use the same indentation for the entire code block. We'll follow the [Python style guide](https://peps.python.org/pep-0008/) and use 4 spaces. - This [`print()`](/mojo/stdlib/builtin/io/print) function is a Mojo built-in so it doesn't require an import. An executable Mojo program *requires* you to define a no-argument `main()` as its entry point. Running the program automatically invokes the `main()` function, and your program exits when the `main()` function returns. To run the program, we first need to start a shell session in our project's virtual environment: ```bash magic shell ``` Later on, when you want to exit the virtual environment, just type `exit`. Now we can use the `mojo` command to run our program. ```bash mojo life.mojo ``` ```output Hello, World! ``` Mojo is a compiled language, not an interpreted one like Python. So when we run our program like this, `mojo` performs [just-in-time compilation](https://en.wikipedia.org/wiki/Just-in-time_compilation) (JIT) and then runs the result. We can also compile our program into an executable file using [`mojo build`](/mojo/cli/build) like this: ```bash mojo build life.mojo ``` By default, this saves an executable file to the current directory named `life`. ```bash ./life ``` ```output Hello, World! ``` ## 3. Create and use variables Let's extend this basic program by prompting the user for their name and including that in the greeting printed. The built-in [`input()`](/mojo/stdlib/builtin/io/input) function accepts an optional [`String`](/mojo/stdlib/collections/string/string/String) argument to use as a prompt, and returns a `String` consisting of the characters the user entered (with the newline character at the end stripped off). So let's declare a variable, assign the return value from `input()` to it, and build a customized greeting. ```mojo title="life.mojo" def main(): var name: String = input("Who are you? ") var greeting: String = "Hi, " + name + "!" print(greeting) ``` Go ahead and run it: ```bash mojo life.mojo ``` ```output Who are you? Edna Hi, Edna! ``` Notice that this code uses a `String` type annotation indicating the type of value that the variable can contain. The Mojo compiler performs [static type checking](https://en.wikipedia.org/wiki/Type_system#Static_type_checking), which means that you'll encounter a compile-time error if your code tries to assign a value of one type to a variable of a different type. Mojo also supports implicitly declared variables, where you simply assign a value to a new variable without using the `var` keyword or indicating its type. So we can replace the code we just entered with the following, and it works exactly the same. ```mojo title="life.mojo" def main(): name = input("Who are you? ") greeting = "Hi, " + name + "!" print(greeting) ``` However, implicitly declared variables still have a fixed type, which Mojo automatically infers from the initial value assignment. So in this example both `name` and `greeting` are inferred as `String` type variables. If you then try to assign an integer value like 42 to the `name` variable, you'll get a compile-time error because of the type mismatch. You can learn more about Mojo variables in the [Variables](/mojo/manual/variables) section of the Mojo manual. ## 4. Use Mojo `Int` and `List` types to represent the game state As originally envisioned by John Conway, the game's "world" is an infinite, two-dimensional grid of square cells, but for our implementation we'll constrain the grid to a finite size. A drawback to making the edges of the grid a hard boundary is that there are fewer neighboring cells around the edges compared to the interior, which tends to cause die offs. Therefore, we'll model the world as a toroid (a donut shape), where the top row is considered adjacent to the bottom row, and the left column is considered adjacent to the right column. This will come into play later when we implement the algorithm for calculating each subsequent generation. To keep track of the height and width of our grid we'll use [`Int`](/mojo/stdlib/builtin/int/Int), which represents a signed integer of the [word size](https://en.wikipedia.org/wiki/Word_(computer_architecture)) of the CPU, typically 32 or 64 bits. To represent the state of an individual cell, we'll represent the cell state with an `Int` value of 1 (populated) or 0 (unpopulated). Later, when we need to determine the number of populated neighbors surrounding a cell, we can simply add the values of the neighboring cells. To represent the state of the entire grid, we need a [collection type](/mojo/manual/types#collection-types). The most appropriate for this use case is [`List`](/mojo/stdlib/collections/list/List), which is a dynamically-sized sequence of values. All of the values in a Mojo `List` must be the same type so that the Mojo compiler can ensure type safety. (For example, when we retrieve a value from a `List[Int]`, the compiler knows that the value is an `Int` and can verify that we then use it correctly). Mojo collections are implemented as [generic types](https://en.wikipedia.org/wiki/Generic_programming), so that we can indicate the type of values the specific collection will hold by specifying a [type parameter](/mojo/manual/parameters/#parameterized-structs) in square brackets like this: ```mojo # The List in row can contain only Int values row = List[Int]() # The List in names can contain only String values names = List[String]() ``` We can also create a `List` with an initial set of values and let the compiler infer the type. ```mojo nums = List(12, -7, 64) # A List[Int] containing 3 Int values ``` The Mojo `List` type includes the ability to append to the list, pop values out of the list, and access list items using subscript notation. Here's a taste of those operations: ```mojo nums = List(12, -7, 64) nums.append(-937) print("Number of elements in the list:", len(nums)) print("Popping last element off the list:", nums.pop()) print("First element of the list:", nums[0]) print("Second element of the list:", nums[1]) print("Last element of the list:", nums[-1]) ``` ```output Number of elements in the list: 4 Popping last element off the list: -937 First element of the list: 12 Second element of the list: -7 Last element of the list: 64 ``` We can also nest `List`s: ```mojo grid = List( List(11, 22), List(33, 44) ) print("Row 0, Column 0:", grid[0][0]) print("Row 0, Column 1:", grid[0][1]) print("Row 1, Column 0:", grid[1][0]) print("Row 1, Column 1:", grid[1][1]) ``` ```output Row 0, Column 0: 11 Row 0, Column 1: 22 Row 1, Column 0: 33 Row 1, Column 1: 44 ``` This looks like a good way to represent the state of the grid for our program. So let's update the `main()` function with the following code that defines an 8x8 grid containing the initial state of a "[glider](https://en.wikipedia.org/wiki/Glider_(Conway%27s_Game_of_Life))" pattern. ```mojo title="life.mojo" def main(): num_rows = 8 num_cols = 8 glider = List( List(0, 1, 0, 0, 0, 0, 0, 0), List(0, 0, 1, 0, 0, 0, 0, 0), List(1, 1, 1, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), ) ``` ## 5. Create and use a function to print the grid Now let's create a function to generate a string representation of the game grid that we can print to the terminal. There are actually two different keywords that we can use to define functions in Mojo: `def` and `fn`. Using `fn` gives us finer level control over the function definition, whereas `def` provides a good set of default behaviors for most use cases. Let's add the following definition of a function named `grid_str()` to our program. The Mojo compiler doesn't care whether we add our function before or after `main()`, but the convention is to put `main()` at the end. ```mojo title="life.mojo" def grid_str(rows: Int, cols: Int, grid: List[List[Int]]) -> String: # Create an empty String str = String() # Iterate through rows 0 through rows-1 for row in range(rows): # Iterate through columns 0 through cols-1 for col in range(cols): if grid[row][col] == 1: str += "*" # If cell is populated, append an asterisk else: str += " " # If cell is not populated, append a space if row != rows-1: str += "\n" # Add a newline between rows, but not at the end return str ``` When we pass a value to a Mojo function, the default behavior for `def` is that an argument is treated as a read-only reference to the value. However, if the Mojo compiler determines that there is code in the function that can change the value, then the argument gets a copy of the original value assigned to it. As we'll see later, we can specify a different behavior by including an explicit [argument convention](/mojo/manual/values/ownership#argument-conventions). In contrast, when you define a function with `fn` Mojo simply treats each argument as a read-only reference by default unless you provide an explicit argument convention. Each argument name is followed by a type annotation indicating the type of value you can pass to the argument. Just like when you're assigning a value to a variable, you'll encounter a compile-time error if your code tries to pass a value of one type to an argument of a different type. Finally, the `-> String` following the argument list indicates that this function has a `String` type return value. In the body of the function, we generate a `String` by appending an asterisk for each populated cell and a space for each unpopulated cell, separating each row of the grid with a newline character. We use nested `for` loops to iterate through each row and column of the grid, using [`range()`](/mojo/stdlib/builtin/range/range) to generate a sequence of integers from 0 up to but not including the given end value. Then we append the correct characters to the `String` representation. See [Control flow](/mojo/manual/control-flow) for more information on `if`, `for`, and other control flow structures in Mojo. :::note As described in [The `for` statement](/mojo/manual/control-flow#the-for-statement) section of the Mojo manual, it's possible to iterate over the elements of a `List` directly instead of iterating over the values of a `range()` and then accessing the `List` elements by their numeric index. However, iterating over a `List` directly currently returns a *reference* to the element, which then requires using the dereference operator, `[]`, to access the actual element value. The code looks like this: ```mojo nums = List(12, -7, 64) for value in nums: print("Value:", value[]) ``` This behavior is likely to change in the future, at which point iterating over a `List` won't require using the dereference operator. But for this tutorial, we'll stick with iterating over a `range()` and accessing the `List` elements by their numeric index. ::: Now that we've defined our `grid_str()` function, let's invoke it from `main()`. ```mojo title="life.mojo" def main(): ... result = grid_str(num_rows, num_cols, glider) print(result) ``` Then run the program to see the result: ```bash mojo life.mojo ``` ```output * * *** ``` We can see that the position of the asterisks matches the location of the 1s in the `glider` grid. ## 6. Create a module and define a custom type We're currently passing three arguments to `grid_str()` to describe the size and state of the grid to print. A better approach would be to define our own custom type that encapsulates all information about the grid. Then any function that needs to manipulate a grid can accept just a single argument. We can do this by defining a Mojo *struct*, which is a custom data structure. A [Mojo struct](/mojo/manual/structs) is a custom type consisting of: - Fields, which are variables containing the data associated with the structure - Methods, which are functions that we can optionally define to manipulate instances of the struct that we create :::note Mojo structs are similar to classes. However, Mojo structs do *not* support inheritance. Mojo doesn't support classes at this time. ::: We could define the struct in our existing `life.mojo` source file, but let's create a separate *module* for the struct. A module is simply a Mojo source file containing struct and function definitions that can be imported into other Mojo source files. To learn more about creating and importing modules, see the [Modules and packages](/mojo/manual/packages) section of the Mojo manual . So create a new source file named `gridv1.mojo` in the project directory containing the following definition of a struct named `Grid` consisting of three fields: ```mojo title="gridv1.mojo" @value struct Grid(): var rows: Int var cols: Int var data: List[List[Int]] ``` Mojo requires you to declare all of the fields in the struct definition. You can't add fields dynamically at run-time. You must declare the type for each field, but you cannot assign a value as part of the field declaration. Instead, the [constructor](/mojo/manual/lifecycle/life#constructor) is responsible for initializing the value of all fields. Mojo structs support several different [lifecycle methods](/mojo/manual/lifecycle/) defining the behavior when an instance of the struct is created, moved, copied, and destroyed. For structs that are basic aggregations of other types and don't require custom resource management or lifecycle behaviors, you can simply add the [`@value`](/mojo/manual/structs#value-decorator) decorator to your struct definition to have the Mojo compiler automatically generate lifecycle methods for you. Because we used the `@value` decorator, `Grid` includes a "member-wise" [constructor](/mojo/manual/lifecycle/life#constructor) . The constructor's arguments are the same names and types as the struct's fields and appear in the same order. So this means that we can create an instance of `Grid` like this: ```mojo my_grid = Grid(2, 2, List(List(0, 1), List(1, 1))) ``` We can then access the field values with "dot" syntax like this: ```mojo print("Rows:", my_grid.rows) ``` ```output Rows: 2 ``` ## 7. Import a module and use our custom `Grid` type Now let's edit `life.mojo` to import `Grid` from our new module and update our code to use it. ```mojo title="life.mojo" from gridv1 import Grid def grid_str(grid: Grid) -> String: # Create an empty String str = String() # Iterate through rows 0 through rows-1 for row in range(grid.rows): # Iterate through columns 0 through cols-1 for col in range(grid.cols): if grid.data[row][col] == 1: str += "*" # If cell is populated, append an asterisk else: str += " " # If cell is not populated, append a space if row != grid.rows - 1: str += "\n" # Add a newline between rows, but not at the end return str def main(): glider = List( List(0, 1, 0, 0, 0, 0, 0, 0), List(0, 0, 1, 0, 0, 0, 0, 0), List(1, 1, 1, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), List(0, 0, 0, 0, 0, 0, 0, 0), ) start = Grid(8, 8, glider) result = grid_str(start) print(result) ``` At this point we've made several changes to improve the structure of our program, but the output should remain the same. ```bash mojo life.mojo ``` ```output * * *** ``` ## 8. Implement `grid_str()` as a method Our `grid_str()` function is really a utility function unique to the `Grid` type. So rather than defining it as a standalone function, it makes more sense to define it as part of the `Grid` type as a method. To do so, move the function into `gridv1.mojo` and edit it to look like this (or simply copy the code below into `gridv1.mojo`): ```mojo title="gridv1.mojo" @value struct Grid(): var rows: Int var cols: Int var data: List[List[Int]] def grid_str(self) -> String: # Create an empty String str = String() # Iterate through rows 0 through rows-1 for row in range(self.rows): # Iterate through columns 0 through cols-1 for col in range(self.cols): if self.data[row][col] == 1: str += "*" # If cell is populated, append an asterisk else: str += " " # If cell is not populated, append a space if row != self.rows - 1: str += "\n" # Add a newline between rows, but not at the end return str ``` So aside from moving the code from one source file to another, there are a few other changes that we made. - The function definition is indented to indicate that it's a method defined by the `Grid` struct. This also changes the way that we invoke the function. Instead of `grid_str(my_grid)` we now write `my_grid.grid_str()`. - We've changed the argument name to `self`. When you invoke an instance method, Mojo automatically passes the instance as the first argument, followed by any explicit arguments that you provide. Although we could use any name we like for this argument, the convention is to call it `self`. - We've deleted the argument's type annotation. The compiler knows that the first argument of the method is an instance of the struct, so it doesn't require an explicit type annotation. Now that we've refactored the function into an instance method, we also need to update the code in `life.mojo` where we invoke it from `main()`: ```mojo title="life.mojo" def main(): ... start = Grid(8, 8, glider) print(start.grid_str()) ``` Once again, our refactoring has improved the structure of our code, but it still produces the same output. You can verify that by running the program again. ## 9. Implement support for the `StringableRaising` trait You can convert most Mojo types to `String` using `String(my_val)` to produce a `String` representation of that instance. But you'll get an error if you try to do that with our current implementation of `Grid`. So let's fix that. Because the Mojo compiler performs static type checking, a `String` constructor can accept a value only if its type implements some required behavior—in this case, it only accepts types that can generate a `String` representation. To enable that, Mojo supports [*traits*](/mojo/manual/traits). A trait is a set of requirements in the form of one or more method signatures. A type can *conform* to that trait by implementing all of the method signatures declared in the trait. Then we can have a function that indicates that it accepts values of any type that conform to a specified trait. (This type of function is sometimes referred to as a [*generic* function](/mojo/manual/parameters/#parameters-and-generics).) In the case of `String()`, it requires a type to conform to either the `Stringable` or `StringableRaising` trait. Each trait requires a conforming type to implement a `__str__()` method that returns a `String` representation. The only difference between the two traits is that `Stringable` requires that the method *cannot* raise an error, whereas `StringableRaising` indicates that the method *might* raise an error. (To learn more, read [The `Stringable`, `Representable`, and `Writable` traits](/mojo/manual/traits#the-stringable-representable-and-writable-traits).) Our `grid_str()` method already returns a `String` representation, so it looks like we just have to rename it to `__str__()`. But we also need to indicate which trait `Grid` conforms to. In our case, it's `StringableRaising` because we used `def` to define the method. If you define a function or method with `def`, the compiler *always* assumes that the function *can* raise an error. In contrast, if you define a function or method with `fn` you must explicitly indicate with a `raises` keyword if it can raise an error. So in `gridv1.mojo` we need to update the `Grid` declaration to indicate that the type conforms to `StringableRaising` and rename the `grid_str()` method to `__str__()`: ```mojo title="gridv1.mojo" @value struct Grid(StringableRaising): ... def __str__(self) -> String: ... ``` Now let's verify that `String()` works with an instance of `Grid`. ```mojo title="life.mojo" def main(): ... start = Grid(8, 8, glider) print(String(start)) ``` If you run the program again, you should still see the same glider pattern as before. ```bash mojo life.mojo ``` ```output * * *** ``` ## 10. Implement methods to support indexing Looking at the implementation of `__str__()` you'll notice that we use `self.data[row][col]` to retrieve the value of a cell in the grid. And if `my_grid` is an instance of `Grid`, we would use `my_grid.data[row][col]` to refer to a cell in the grid. This breaks a fundamental principle of encapsulation in that we need to know that `Grid` stores the game state in a field called `data`, and that field is a `List[List[Int]]`. If we later decide to change the internal implementation of `Grid`, then there could be a lot of code that would need to be changed. A cleaner approach is to provide "getter" and "setter" methods to access cell values. We could simply define methods like `get_cell()` and `set_cell()`, but this is a good opportunity to show how we can define the behavior of built-in operators for custom Mojo types. Specifically, we'll implement support for indexing, so that we can refer to a cell with syntax like `my_grid[row, col]`. This will be useful when we implement support for evolving the state of the grid. As described in [Operators, expressions, and dunder methods](/mojo/manual/operators), Mojo allows us to define the behavior of many of the built-in operators for a custom type by implementing special *dunder* (double underscore) methods. In the case of indexing, the two methods are `__getitem__()` and `__setitem__()`. So let's add the following methods to the `Grid` struct in `gridv1.mojo`: ```mojo title="gridv1.mojo" @value struct Grid(StringableRaising): ... def __getitem__(self, row: Int, col: Int) -> Int: return self.data[row][col] def __setitem__(mut self, row: Int, col: Int, value: Int) -> None: self.data[row][col] = value ``` The implementation of `__getitem__()` is easy. For the given values of `row` and `col` we just need to retrieve and return the corresponding value from the nested `List[List[Int]]` stored in the `data` field of the instance. The body of `__setitem__()` is similarly straightforward. We just take the given `value` and store it in the corresponding `row` and `col` in `data`. One thing new in the declaration is that we set the return type to `None` to indicate that the method doesn't have a return value. But more notable is that we've added the `mut` [argument convention](/mojo/manual/values/ownership#argument-conventions) to the `self` argument to explicitly tell the Mojo compiler that we want to mutate the state of the current instance. If we were to omit `mut`, we would get an error because the compiler would default to read-only access for the argument. Now that we've implemented these methods, we can update `__str__()` to use indexing syntax to access the cell value. ```mojo title="gridv1.mojo" @value struct Grid(StringableRaising): ... def __str__(self) -> String: ... # Iterate through columns 0 through cols-1 for col in range(self.cols): if self[row, col] == 1: ... ``` Click here to see the complete `gridv1.mojo` so far: ```mojo title="gridv1.mojo" @value struct Grid(StringableRaising): var rows: Int var cols: Int var data: List[List[Int]] def __str__(self) -> String: # Create an empty String str = String() # Iterate through rows 0 through rows-1 for row in range(self.rows): # Iterate through columns 0 through cols-1 for col in range(self.cols): if self[row, col] == 1: str += "*" # If cell is populated, append an asterisk else: str += " " # If cell is not populated, append a space if row != self.rows - 1: str += "\n" # Add a newline between rows, but not at the end return str def __getitem__(self, row: Int, col: Int) -> Int: return self.data[row][col] def __setitem__(mut self, row: Int, col: Int, value: Int) -> None: self.data[row][col] = value ``` Our refactoring hasn't changed our program's behavior, but it's still a good idea to run it to be sure that we don't have any errors in our code. ## 11. Define a static method to generate random grids So far, we've used the glider to build the basic functionality of our `Grid` type. But what's much more interesting is to start with a grid in a random state and see how it evolves over time. Let's add a *static method* named `random()` to the `Grid` struct to generate and return an instance of `Grid` with a random state. A static method doesn't operate on specific instances of the type, so it can be invoked as a utility function. We indicate that a method is a static method by using the `@staticmethod` decorator. ```mojo title="gridv1.mojo" import random @value struct Grid(StringableRaising): ... @staticmethod def random(rows: Int, cols: Int) -> Self: # Seed the random number generator using the current time. random.seed() data = List[List[Int]]() for row in range(rows): row_data = List[Int]() for col in range(cols): # Generate a random 0 or 1 and append it to the row. row_data.append(Int(random.random_si64(0, 1))) data.append(row_data) return Self(rows, cols, data) ``` At the top of the file we're importing the `random` package from the Mojo standard library. It includes several functions related to random number generation. By default, the [pseudorandom number generator](https://en.wikipedia.org/wiki/Pseudorandom_number_generator) used by the Mojo standard library currently uses a fixed seed. This means that it generates the same sequence of numbers unless you provide a different seed, which is useful for testing purposes. But for this application we want to call `random.seed()` to set a seed value based on the current time, which gives us a unique value every time. Then we create an empty `List[List[Int]]` that we populate with a random initial state. For each cell, we call [`random.random_si64()`](/mojo/stdlib/random/random/random_si64), which returns a random integer value from the provided minimum and maximum values of 0 and 1, respectively. This function actually returns a value of type `Int64`, which is a signed 64-bit integer value. As described in [Numeric types](/mojo/manual/types#numeric-types), this is *not* the same as the `Int` type whose precision is dependent on the native word size of the system. Therefore we're passing this value to the [`Int()`](/mojo/stdlib/builtin/int/Int/#__init__) constructor, which explicitly converts a numeric value to an `Int`. The return type of the method is `Self`, which is an alias for the type of the struct. This is a convenient shortcut if the actual name of the struct is long or includes parameters. The last line uses `Self()` to invoke the struct's constructor and return a newly created instance with random data. Now we can update the `main()` function in `life.mojo` to create a random `Grid` and print it. ```mojo title="life.mojo" ... def main(): start = Grid.random(8, 16) print(String(start)) ``` Run the program a few times to verify that it generates a different grid each time. ```bash mojo life.mojo ``` ```output *** * **** * **** ****** * * ***** * * ** ** * * ** **** * ** * * * *** * * ** ** ** * ***** ** ``` ## 12. Implement a method to evolve the grid It's finally time to let our world evolve. We'll implement an `evolve()` method to calculate the state of the grid for the next generation. One option would be to do an in-place modification of the existing `Grid` instance. But instead we'll have `evolve()` return a new instance of `Grid` for the next generation. ```mojo title="gridv1.mojo" ... struct Grid(StringableRaising): ... def evolve(self) -> Self: next_generation = List[List[Int]]() for row in range(self.rows): row_data = List[Int]() # Calculate neighboring row indices, handling "wrap-around" row_above = (row - 1) % self.rows row_below = (row + 1) % self.rows for col in range(self.cols): # Calculate neighboring column indices, handling "wrap-around" col_left = (col - 1) % self.cols col_right = (col + 1) % self.cols # Determine number of populated cells around the current cell num_neighbors = ( self[row_above, col_left] + self[row_above, col] + self[row_above, col_right] + self[row, col_left] + self[row, col_right] + self[row_below, col_left] + self[row_below, col] + self[row_below, col_right] ) # Determine the state of the current cell for the next generation new_state = 0 if self[row, col] == 1 and (num_neighbors == 2 or num_neighbors == 3): new_state = 1 elif self[row, col] == 0 and num_neighbors == 3: new_state = 1 row_data.append(new_state) next_generation.append(row_data) return Self(self.rows, self.cols, next_generation) ``` We start out with an empty `List[List[Int]]` to represent the state of the next generation. Then we use nested `for` loops to iterate over each row and each column of the existing `Grid` to determine the state of each cell in the next generation. For each cell in the grid we need to count the number of populated neighboring cells. Because we're modeling the world as a toroid, we need to consider the top and bottom rows as adjacent and the left-most and right-most columns as adjacent. So as we iterate through each row and column, we're using the modulo operator, `%`, to handle "wrap-around" when we calculate the indices of the rows above and below and the columns to the left and right of the current cell. (For example, if there are 8 rows, then `-1 % 8` is 7.) Then we apply the Game of Life rules that determines if the current cell is populated (1) or unpopulated (0) for the next generation: - A populated cell with either 2 or 3 populated neighbors remains populated in the next generation - An unpopulated cell with exactly 3 populated neighbors becomes populated in the next generation - All other cells become unpopulated in the next generation After calculating the state of the next generation, we use `Self()` to create an new instance of `Grid`, and return the newly created instance. Now that we can evolve the grid, let's use it in `life.mojo`. We'll add a `run_display()` function to control the game's main loop: - Display the current `Grid` - Prompt the user to continue or quit - Break out of the loop if the user enters `q` - Otherwise, calculate the next generation and loop again Then we'll update `main()` to create a random initial `Grid` and pass it to `run_display()`. Here is the updated version of `life.mojo`: ```mojo title="life.mojo" from gridv1 import Grid def run_display(owned grid: Grid) -> None: while True: print(String(grid)) print() if input("Enter 'q' to quit or press to continue: ") == "q": break grid = grid.evolve() def main(): start = Grid.random(16, 16) run_display(start) ``` Run the program and verify that each call to `evolve()` successfully produces a new generation. So now we have a working version of the Game of Life, but the terminal interface is not very pretty. Let's spice things up with a nicer graphical user interface, using a Python library. ## 13. Import and use a Python package Mojo lets you import Python modules, call Python functions, and interact with Python objects from Mojo code. To demonstrate this capability, we're going to use a Python package called [pygame](https://www.pygame.org) to create and manage a graphical user interface for our Game of Life program. First, we need to update our `mojoproject.toml` file to add a dependency on Python and the `pygame` package. So in the project directory, execute the following command from the terminal: ```bash magic add "python>=3.11,=2.6.1, None: # Import the pygame Python package pygame = Python.import_module("pygame") # Initialize pygame modules pygame.init() # Create a window and set its title window = pygame.display.set_mode( Python.tuple(window_height, window_width) ) pygame.display.set_caption("Conway's Game of Life") cell_height = window_height / grid.rows cell_width = window_width / grid.cols border_size = 1 cell_fill_color = pygame.Color(cell_color) background_fill_color = pygame.Color(background_color) running = True while running: # Poll for events event = pygame.event.poll() if event.type == pygame.QUIT: # Quit if the window is closed running = False elif event.type == pygame.KEYDOWN: # Also quit if the user presses or 'q' if Bool(event.key == pygame.K_ESCAPE) or Bool( event.key == pygame.K_q ): running = False # Clear the window by painting with the background color window.fill(background_fill_color) # Draw each live cell in the grid for row in range(grid.rows): for col in range(grid.cols): if grid[row, col]: x = col * cell_width + border_size y = row * cell_height + border_size width = cell_width - border_size height = cell_height - border_size pygame.draw.rect( window, cell_fill_color, Python.tuple(x, y, width, height), ) # Update the display pygame.display.flip() # Pause to let the user appreciate the scene time.sleep(pause) # Next generation grid = grid.evolve() # Shut down pygame cleanly pygame.quit() def main(): start = Grid.random(128, 128) run_display(start) ``` Each argument for `run_display()` other than `grid` has a default value associated with it (for example, the default `window_height` is 600 pixels). If you don't explicitly pass a value for an argument when you invoke `run_display()`, Mojo uses the default value specified in the function definition. After importing the `pygame` module, we call `pygame.init()` to initialize all the pygame subsystems. The `set_mode()` function creates and initializes a window, with the height and width passed as a Python tuple of two values. This returns a [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper for the window, which we can then use to call functions and set attributes to manipulate the window. (For more information about interacting with Python objects from Mojo, see [Python types](/mojo/manual/python/types).) The bulk of the `run_display()` function is a loop that uses `pygame` to poll for events like key presses and mouse clicks. If it detects that the user presses `q` or the `` key or closes the display window, it ends the program with `pygame.quit()`. Otherwise, it clears the window and then iterates through all cells in the grid to display the populated cells. After sleeping for `pause` seconds, it evolves the grid to the next generation and loops again. So it's finally time to try it out. ```bash mojo life.mojo ``` Now when you run the program you should see a new window appear on screen displaying your evolving grid. We now have a fully functional implementation of the Game of Life with a nice interface. We've come quite a way from just displaying a few asterisks on the terminal! ![game_of_life_screen.png](images/game-of-life-screen.png) To quit the program press the `q` or `` key, or close the window. And now that we're done with the tutorial, exit our project's virtual environment: ```bash exit ``` ## Summary Congratulations on writing a complete Mojo application from scratch! Along the way, you got a taste of: - Using Magic to create, build, and run a Mojo program - Using Mojo built-in types like `Int`, `String`, and `List` - Creating and using variables and functions - Using control structures like `if`, `while`, and `for` - Defining and using a custom Mojo struct - Creating and importing a Mojo module - Using modules from the Mojo standard library - Importing and using a Python module ## Next steps Now that you've seen a bit of what Mojo can do, here are some suggested next steps: - Read through the [Mojo manual](/mojo/manual/) for more detail about all of Mojo's features. - Check out [Get started with GPU programming with Mojo and the MAX Driver](/mojo/manual/gpu/intro-tutorial) for an example of how to write GPU functions with Mojo. - Explore more Mojo [examples](https://github.com/modular/modular/tree/main/examples/mojo) in the public [MAX GitHub repository](https://github.com/modular/modular). import TutorialStack from '@site/src/components/TutorialStack'; export const maxTutorials = [ 'magic', ]; export const mojoTutorials = [ 'gpu/intro-tutorial', ]; --- ## get_accum_type `get_accum_type[dtype: DType, *, preferred_accum_type: DType = float32]() -> DType` Returns the recommended dtype for accumulation operations. Half precision and float8 types can introduce numerical error if they are used in reduction/accumulation operations. This method returns a higher precision dtype to use for accumulation if a half precision types is provided, otherwise it returns the original dtype. The rules are as follows: \- If the dtype is a float8 type, return a float16 type. \- If the dtype is a bfloat16 precision type, return a float32 type. \- If the dtype is a float16 precision type, return a float32 dtype if the preferred\_accum\_type is float32, otherwise return a float16 type. \- Otherwise, return the original type. **Parameters:** * ​dtype (`DType`): The dtype of some accumulation operation. * ​preferred\_accum\_type (`DType`): The preferred dtype for accumulation. **Returns:** The recommended dtype for accumulation operations based on the input dtype and the preferred accumulation type. --- ## get_cblas_f32_function `get_cblas_f32_function() -> fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None` --- ## get_config_from_shape `get_config_from_shape[a_type: DType, b_type: DType, c_type: DType, static_N: Int, static_K: Int, transpose_b: Bool = False, target: StringSlice[StaticConstantOrigin] = _accelerator_arch()](dyn_M: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## get_conv_num_partitions `get_conv_num_partitions[micro_kernel_w: Int, micro_kernel_f: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]` Partition the worload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. --- ## get_conv_num_tasks `get_conv_num_tasks(num_threads: Int, conv_shape: ConvShape[rank]) -> Int` --- ## get_conv_shape `get_conv_shape[rank: Int, filter_packed: Bool](output: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[rank], dilation: IndexList[rank], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int) -> ConvShape[rank]` --- ## get_conv_tile_shape `get_conv_tile_shape[type: DType](c: Int, filter_window_size: Int, micro_kernel_width: Int) -> IndexList[2]` Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F. --- ## get_conv_tile_size `get_conv_tile_size[type: DType]() -> Int` --- ## get_conv2d_shape `get_conv2d_shape[output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, 4, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]` `get_conv2d_shape[filter_rank: Int, output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, filter_rank, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]` --- ## get_current_trace_id `get_current_trace_id[level: TraceLevel]() -> Int` Returns the id of last created trace entry on the current thread. **Parameters:** * ​level (`TraceLevel`): The trace level to check. **Returns:** The ID of the current trace if profiling is enabled, otherwise 0. --- ## get_direct_conv_micro_kernel_height `get_direct_conv_micro_kernel_height() -> Int` --- ## get_direct_conv_micro_kernel_width `get_direct_conv_micro_kernel_width() -> Int` --- ## get_dispatch_table `get_dispatch_table[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> Dict[String, MatmulConfig[a_type, b_type, c_type, transpose_b]]` --- ## get_fragment_size `get_fragment_size[mma_shape: IndexList[3]]() -> IndexList[3]` Calculates the fragment size per thread for a given MMA shape. For tensor core operations, each thread in a warp handles a portion of the computation. This function determines how many elements each thread needs to process for the A, B, and C/D matrices based on the MMA shape. **Parameters:** * ​mma\_shape (`IndexList[3]`): An `IndexList[3]` containing the MMA dimensions \[M, N, K]. **Returns:** An `IndexList[3]` containing the fragment sizes per thread for matrices A, B, and C/D respectively, calculated as: `[M*K/WARP_SIZE, N*K/WARP_SIZE, M*N/WARP_SIZE]`. --- ## get_identity_rope_coeff `get_identity_rope_coeff[width: Int, type: DType]() -> SIMD[type, width]` --- ## get_kernel_config `get_kernel_config[a_type: DType, b_type: DType, c_type: DType, *, kernel_type: Bool = False]() -> KernelConfig` Utility function to extract matmul configuration parameters for exported Functions. TODO: Add target dependent configuration parameters. --- ## get_kernel_type `get_kernel_type(m: Int, n: Int, k: Int) -> Bool` --- ## get_linkage_name `get_linkage_name[func_type: AnyTrivialRegType, //, target: target, func: func_type]() -> StringSlice[StaticConstantOrigin]` Returns `func` symbol name. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of func. * ​target (`target`): The compilation target. * ​func (`func_type`): A mojo function. **Returns:** Symbol name. `get_linkage_name[func_type: AnyTrivialRegType, //, func: func_type]() -> StringSlice[StaticConstantOrigin]` Returns `func` symbol name. **Parameters:** * ​func\_type (`AnyTrivialRegType`): Type of func. * ​func (`func_type`): A mojo function. **Returns:** Symbol name. --- ## get_matmul_arch_factor `get_matmul_arch_factor[use_vnni: Bool, use_i8mm: Bool]() -> Int` --- ## get_matmul_kernel_shape `get_matmul_kernel_shape[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape` --- ## get_matmul_kernel_shape_ARM `get_matmul_kernel_shape_ARM[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape` --- ## get_matmul_kernel_shape_x86 `get_matmul_kernel_shape_x86[kernel_type: Bool]() -> MicroKernelShape` --- ## get_matmul_num_tasks `get_matmul_num_tasks[a_type: DType, b_type: DType, c_type: DType, simd_size: Int, kernel_type: Bool](m: Int, n: Int, k: Int, max_num_tasks: Int) -> Int` Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores. --- ## get_matmul_prefetch_b_distance_k `get_matmul_prefetch_b_distance_k() -> Int` --- ## get_mha_decoding_num_partitions `get_mha_decoding_num_partitions[num_heads: Int, group: Int](batch_size: Int, num_keys: Int, ctx: DeviceContext) -> Int` --- ## get_micro_kernel_shape `get_micro_kernel_shape[rank: Int, WO: Dim, F: Dim, conv_attr: ConvInfoStatic[rank], simd_size: Int]() -> IndexList[2]` --- ## get_min_task_size `get_min_task_size() -> Int` --- ## get_mma_shape `get_mma_shape[input_type: DType, accum_type: DType, shape_id: Int = 0]() -> IndexList[3]` Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations. Selects the optimal MMA shape based on the GPU architecture, input data type, accumulation data type, and optional shape identifier. This function handles different configurations for both NVIDIA and AMD GPUs. **Parameters:** * ​input\_type (`DType`): The data type of the input matrices (A and B). * ​accum\_type (`DType`): The data type used for accumulation (C and D). * ​shape\_id (`Int`): Optional identifier to select between multiple valid shapes (default: 0). **Returns:** An `IndexList[3]` containing the MMA dimensions in the format `[M, N, K]`, where `M×N` is the output matrix size and `K` is the reduction dimension. --- ## get_num_partitions `get_num_partitions[micro_kernel_height: Int, micro_kernel_f_size: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]` Partition the worload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions. --- ## get_pack_data_size `get_pack_data_size[type: DType]() -> Int` Utility to compute the number of elements to pack in each tile. Returns: The number of elements to pack. --- ## get_packB_unroll_factor `get_packB_unroll_factor() -> Int` --- ## get_padding_output_shape `get_padding_output_shape[rank: Int](input_shape: IndexList[rank], paddings: LayoutTensor[index, __init__[::Origin[::Bool(IntTuple((rank * 2))), origin]) -> IndexList[rank]` --- ## get_partition `get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition` --- ## get_partition `get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition` --- ## get_partitioned_matmul `get_partitioned_matmul[a_type: DType, b_type: DType, c_type: DType, kernel_rows: Int, kernel_cols: Int](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig` --- ## get_partitioned_matmul_mojo `get_partitioned_matmul_mojo[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool = False](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig` --- ## get_partitioned_matmul_mojo_shape `get_partitioned_matmul_mojo_shape[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool](m: Int, n: Int, k: Int, num_tasks: Int) -> IndexList[2]` --- ## get_safetensors_idx `get_safetensors_idx(head_dim_idx: Int, head_size: Int) -> Tuple[Int, Int]` --- ## get_sliding_window_out_dim `get_sliding_window_out_dim[ceil_mode: Bool = False](in_dim: Int, ft_dim: Int, dilation: Int, stride: Int, pad: Int) -> Int` Return output dimension for a sliding window operation along some dimension. **Parameters:** * ​ceil\_mode (`Bool`): Define rounding mode for shape calculation. **Args:** * ​in\_dim (`Int`): The size of the input dimension. * ​ft\_dim (`Int`): The size of the corresponding filter dimension. * ​dilation (`Int`): The dilation for the sliding window operation. * ​stride (`Int`): The stride for the sliding window operation. * ​pad (`Int`): The total padding for the sliding window operation. **Returns:** The size of the output dimension. --- ## get_start_and_end_for_partitions `get_start_and_end_for_partitions[tile_size: Int](num_keys: Int, num_partitions: Int, partition_idx: Int) -> Tuple[Int, Int]` Calculate start and end indices for a partition. **Args:** * ​num\_keys (`Int`): Total number of keys (sequence length). * ​num\_partitions (`Int`): Number of partitions to split keys into. * ​partition\_idx (`Int`): Index of current partition (0 to num\_partitions-1). **Returns:** Tuple of (start\_idx, end\_idx) for the partition, aligned to tile\_size. --- ## get_static_string `get_static_string[string: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]` Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory. It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range. **Parameters:** * ​string (`StringSlice[StaticConstantOrigin]`): The first StringSlice value. * ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional StringSlice values to concatenate. **Returns:** The string value as a StaticString. --- ## getenv `getenv(owned name: String, default: String = __init__[__mlir_type.!kgen.string]("")) -> String` Returns the value of the given environment variable. **Constraints:** The function only works on macOS or Linux and returns an empty string otherwise. **Args:** * ​name (`String`): The name of the environment variable. * ​default (`String`): The default value to return if the environment variable doesn't exist. **Returns:** The value of the environment variable. --- ## getpwnam `getpwnam(owned name: String) -> Passwd` Retrieves the user ID in the password database for the given user name. **Constraints:** This function is constrained to run on Linux or macOS operating systems only. **Args:** * ​name (`String`): The name of the user to retrieve the password entry for. **Returns:** An object containing the user's account information, including login name, encrypted password, user ID, group ID, real name, home directory, and shell program. **Raises:** If the user name does not exist or there is an error retrieving the information. --- ## getpwuid `getpwuid(uid: Int) -> Passwd` Retrieve the password database entry for a given user ID. **Constraints:** This function is constrained to run on Linux or macOS operating systems only. **Args:** * ​uid (`Int`): The user ID for which to retrieve the password database entry. **Returns:** An object containing the user's account information, including login name, encrypted password, user ID, group ID, real name, home directory, and shell program. **Raises:** If the user ID does not exist or there is an error retrieving the information. --- ## getsize `getsize[PathLike: PathLike, //](path: PathLike) -> Int` Return the size, in bytes, of the specified path. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file. **Returns:** The size of the path in bytes. --- ## gettempdir `gettempdir() -> Optional[String]` Return the default directory to use for temporary files. **Returns:** The name of the default temporary directory. --- ## getuid `getuid() -> Int` Retrieve the user ID of the calling process. **Constraints:** This function is constrained to run on Linux or macOS operating systems only. **Returns:** The user ID of the calling process. --- ## gevm_kernel `gevm_kernel[c_type: DType, a_type: DType, b_type: DType, *, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` --- ## gevm_tc_kernel_vector_8x `gevm_tc_kernel_vector_8x[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, simd_width: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin], a: NDBuffer[a_type, 2, MutableAnyOrigin], b: NDBuffer[b_type, 2, MutableAnyOrigin], m: UInt, n: UInt, k: UInt)` --- ## globals This module provides GPU-specific global constants and configuration values. The module defines hardware-specific constants like warp size and thread block limits that are used throughout the GPU programming interface. It handles both NVIDIA and AMD GPU architectures, automatically detecting and configuring the appropriate values based on the available hardware. The constants are resolved at compile time based on the target GPU architecture and are used to optimize code generation and ensure hardware compatibility. ## Aliases ### `MAX_THREADS_PER_BLOCK_METADATA` `alias MAX_THREADS_PER_BLOCK_METADATA = _resolve_max_threads_per_block_metadata()` This is metadata tag that is used in conjunction with \_\_llvm\_metadata to give a hint to the compiler about the max threads per block that's used. ### `WARP_SIZE` `alias WARP_SIZE = _resolve_warp_size()` The number of threads that execute in lockstep within a warp on the GPU. This constant represents the hardware warp size, which is the number of threads that execute instructions synchronously as a unit. The value is architecture-dependent: * 32 threads per warp on NVIDIA GPUs * 64 threads per warp on AMD GPUs * 0 if no GPU is detected The warp size is a fundamental parameter that affects: * Thread scheduling and execution * Memory access coalescing * Synchronization primitives * Overall performance optimization --- ## Glossary import MDXListing from '@site/src/components/Listing/MDXListing'; ## AI terms export const aiTerms = [ 'ai/*.mdx' ] ## GPU terms export const gpuTerms = [ 'gpu/*.mdx' ] --- ## gpu Provides low-level programming constructs for working with GPUs. These low level constructs allow you to write code that runs on the GPU with traditional programming style--partitioning work across threads that are mapped onto 1-, 2-, or 3-dimensional blocks. The thread blocks can subsequently be grouped into a grid of thread blocks. A *kernel* is a function that runs on the GPU in parallel across many threads. Currently, the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct provides the interface for compiling and launching GPU kernels inside MAX [custom operations](/max/custom-ops/). The [`gpu.host`](/mojo/stdlib/gpu/host/) package includes APIs to manage interaction between the *host* (that is, the CPU) and *device* (that is, the GPU or accelerator). See the [`gpu.id`](/mojo/stdlib/gpu/id#aliases) module for a list of aliases you can use to access information about the grid and the current thread, including block dimensions, block index in the grid and thread index. The [`sync`](/mojo/stdlib/gpu/sync/) module provides functions for synchronizing threads. For an example of launching a GPU kernel from a MAX custom operation, see the [vector addition example](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/vector_addition.mojo) in the MAX repo. ## Packages * [​`comm`](/mojo/stdlib/gpu/comm/): The `gpu.comm` package provides communication primitives for GPUs. * [​`host`](/mojo/stdlib/gpu/host/): Implements the gpu host package. ## Modules * [​`block`](/mojo/stdlib/gpu/block/): GPU block-level operations and utilities. * [​`cluster`](/mojo/stdlib/gpu/cluster/): This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures. * [​`globals`](/mojo/stdlib/gpu/globals/): This module provides GPU-specific global constants and configuration values. * [​`grid_controls`](/mojo/stdlib/gpu/grid_controls/): Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs. * [​`id`](/mojo/stdlib/gpu/id/): This module provides GPU thread and block indexing functionality. * [​`intrinsics`](/mojo/stdlib/gpu/intrinsics/): Provides low-level GPU intrinsic operations and memory access primitives. * [​`memory`](/mojo/stdlib/gpu/memory/): This module provides GPU memory operations and utilities. * [​`mma`](/mojo/stdlib/gpu/mma/): This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions. * [​`mma_sm100`](/mojo/stdlib/gpu/mma_sm100/): This module includes utilities for working with the SM100 MMA instructions. * [​`mma_util`](/mojo/stdlib/gpu/mma_util/): Matrix multiply accumulate (MMA) utilities for GPU tensor cores. * [​`profiler`](/mojo/stdlib/gpu/profiler/): This module provides GPU profiling functionality. * [​`random`](/mojo/stdlib/gpu/random/): Random number generation for GPU kernels. * [​`semaphore`](/mojo/stdlib/gpu/semaphore/): This module provides a device-wide semaphore implementation for NVIDIA GPUs. * [​`sync`](/mojo/stdlib/gpu/sync/): This module provides GPU synchronization primitives and barriers. * [​`tcgen05`](/mojo/stdlib/gpu/tcgen05/): This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions. * [​`tensor_ops`](/mojo/stdlib/gpu/tensor_ops/): This module provides tensor core operations and utilities for GPU computation. * [​`warp`](/mojo/stdlib/gpu/warp/): GPU warp-level operations and utilities. --- ## GPU debugging The MAX SDK provides support for debugging Mojo code running on GPU using [CUDA-GDB](https://docs.nvidia.com/cuda/cuda-gdb/index.html#). You can either debug using the `cuda-gdb` command-line interface, or through VS Code, using the Mojo and NVIDIA extensions. :::note Limitations Currently there are a couple of notable limitations to debugging Mojo code on GPU: - GPU debugging is supported only on NVIDIA GPUs. - You cannot debug Mojo code running inside a MAX [custom operation](/max/custom-ops/). (You can only debug Mojo GPU code launched from a Mojo program when using the [gpu.host](/mojo/stdlib/gpu/host/) API). ::: ## GPU debugging setup To debug Mojo code on GPU, you need to be able run Mojo code on GPU. Currently this requires a Linux system with a supported GPU. For details, see [GPU requirements](/max/faq/#gpu-requirements). If you're using VS Code, you can run it on the same system where your GPU code is running (the "target system"), or on a separate system using remote debugging. To debug on GPU, you need to choose to debug with CUDA-GDB when you start a debugging session. Note that CUDA-GDB has very limited debugging capabilities for Mojo code running on the CPU. To set up for GPU debugging: 1. Install the [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) 12.4 or later on the target system. Make sure that the `cuda-gdb` binary is in your `$PATH` environment variable. For example, if you have CUDA Toolkit 12.8 installed, add `/usr/local/cuda12.8/bin` to your `$PATH`. 2. If using VS Code, install [Nsight Visual Studio Code Edition](https://marketplace.visualstudio.com/items?itemName=nvidia.nsight-vscode-edition) from the Visual Studio Marketplace. ### Using the classic debugger backend CUDA-GDB includes two debugger backends, called the "classic debugger" and the "universal debugger." By default, Mojo uses the CUDA-GDB universal debugger. However, some systems require the classic debugger, instead. If you find that the debugger is losing its connection with the process being debugged, you may need to use the classic debugger. To use the classic debugger backend: - On the target system, set the environment variable `CUDBG_USE_LEGACY_DEBUGGER` to `1` in your shell configuration file (for example, `.bashrc`, `.zshrc` or `config.fish`). Source the file or start a new shell. - When creating a launch configuration for GPU debugging, add the following settings to the `launch.json` configuration: ```json "legacyDebugger": true, "initCommands": [ "set environment CUDBG_USE_LEGACY_DEBUGGER=1" ], ``` ## Start GPU debugging from the command line To start a GPU debugging session in VS Code from the command line, run the following command on the target system: ```bash mojo debug --cuda-gdb --break-on-launch --vscode myproject.mojo ``` To use the CUDA-GDB command line debugger, omit the `--vscode` argument. The `--break-on-launch` flag is optional but very useful: it stops execution as soon as the GPU kernel launches, allowing you to set breakpoints inside the GPU code. ## Start GPU debugging from VS Code The easiest way to start GPU debugging from VS Code is to add a [Launch configuration](#launch-configurations). For example, the following launch configuration starts debugging the current Mojo file using CUDA-GDB. ```json { "type": "mojo-cuda-gdb", "request": "launch", "name": "Mojo: Debug current Mojo file with CUDA-GDB", "description": "Launch and debug the Mojo file that is active on the editor when the debug session starts, using CUDA-GDB.", "mojoFile": "${file}", "args": [], "env": [], "cwd": "${workspaceFolder}", "breakOnLaunch": true, "legacyDebugger": true, "initCommands": [ "set environment CUDBG_USE_LEGACY_DEBUGGER=1" ], }, ``` The last two settings, `legacyDebugger` and `initCommands` should only be included if required on your system to maintain a stable connection to the process being debugged, as described in [Using the classic debugger backend](#using-the-classic-debugger-backend). ## Issuing CUDA-GDB commands Some features of the debugger are only available via CUDA-GDB commands, so it's worth familiarizing yourself with CUDA-GDB even if you're using VS Code as a frontend. If you're running in VS Code, you can enter CUDA-GDB commands in the debug console by prefixing them with a single backtick (`). Note that the console may automatically add a second backtick at the end of your command, which prevents it from being recognized as a CUDA-GDB command. Be sure to remove the second backtick before submitting the command. The examples in the following section are raw CUDA-GDB commands, without the backtick. A good starting point for learning about CUDA-GDB is the [CUDA-GDB User Manual](https://docs.nvidia.com/cuda/cuda-gdb/index.html#) ## Tips and tricks The following sections provide tips for some common tasks in GPU debugging. ### Setting breakpoints in GPU code There are a few quirks to setting breakpoints in GPU code. If you set a breakpoint in GPU code before the GPU kernel launches, the breakpoint will show up in a different function (a CPU function). When the debugger pauses at this first breakpoint, click **Continue** to resume execution, and the debugger should stop at the correct location in the GPU code. When paused at the first breakpoint, you can add more breakpoints in the GPU code, however, the breakpoints won't show up in the left gutter until the GPU kernel launches. Symbol breakpoints aren't supported when debugging with CUDA-GDB. On the CUDA-GDB command line, you can set a breakpoint using the `break` command (which can be abbreviated to `b`): break filename.mojo:line_number b filename.mojo:line_number ### Stepping not supported on GPU The step commands—**Step Over**, **Step Into**, and **Step Out** do not work reliably on GPU. Instead we recommend adding breakpoints and using **Continue** to move between breakpoints. ### Changing kernel focus You can use CUDA-GDB commands to change the current _kernel focus_: that is, the block and thread index that you're currently inspecting. Use the `cuda` to inspect the current focus or change focus: ```plaintext cuda block thread block (0,0,0), thread (0,0,0) cuda block 0,0,0 thread 1,0,0 [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1] ``` For more information, see [Kernel focus](https://docs.nvidia.com/cuda/cuda-gdb/index.html#kernel-focus) in the CUDA-GDB documentation. ### Inspecting registers Inspect the values of registers using the `info registers` command. ```plaintext `info registers $R0 $R1 info registers $R0 $R1 R0 0xebfffda8 -335544920 R1 0xfffda8 16776616 ``` For more information, see [Registers](https://docs.nvidia.com/cuda/cuda-gdb/index.html#registers) in the CUDA-GDB documentation. --- ## GPU glossary import MDXListing from '@site/src/components/Listing/MDXListing'; export const terms = [ '*.mdx' ] --- ## GPU memory GPU memory consists of both on-chip memory and external dynamic random-access memory (DRAM), often referred to as *device memory* (in contrast to the *host memory* used by the CPU). On-chip memory includes: - A register file for each [streaming multiprocessor](streaming-multiprocessor.mdx) (SM), containing the [registers](register.mdx) used by threads executing on the SMs cores - An L1 cache for each SM to cache reads from global memory - Shared memory for each SM, containing data explicitly shared between the threads of a given [thread block](thread-block.mdx) executing on the SM - A read-only constant cache for each SM, which caches data read from the constant memory space in global memory - An L2 cache shared by all SMs that is used to cache accesses to local or global memory, including temporary register spills Device memory includes: - Global memory, which contains data accessible to all threads - Constant memory, which contains data explicitly identified as read-only by the programmer, and which is accessible to all threads - Local memory, which contains data private to an individual thread, such as statically allocated arrays, spilled registers, and other elements of the thread's call stack Data in global memory persists until explicitly freed, even across [kernel](kernel.mdx) functions. This means that one kernel can write data to global memory and then a subsequent kernel can read that data. --- ## gpu_qint4_repack_GPTQ `gpu_qint4_repack_GPTQ[b_shape: DimList, b_packed_shape: DimList, //, group_size: Int, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_packed_shape], perm_idx: OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]]({:i1 0, 1}), ctx: DeviceContextPtr = DeviceContextPtr())` --- ## gpu_qint4_repack_Q4_0 `gpu_qint4_repack_Q4_0[b_shape: DimList, //, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_shape], ctx: DeviceContextPtr = DeviceContextPtr())` --- ## graph APIs to build inference graphs for MAX Engine with Python. ## Classes * [`BufferValue`](/max/api/python/graph/BufferValue): Represents a mutable semantic tensor within a Graph. * [`Graph`](/max/api/python/graph/Graph): Represents a graph for MAX Engine. * [`KernelLibrary`](/max/api/python/graph/KernelLibrary): Represents a library with custom ops. * [`TensorValue`](/max/api/python/graph/TensorValue): Represents a value semantic tensor within a Graph. * [`Value`](/max/api/python/graph/Value): Represents a symbolic value within a Graph. * [`Weight`](/max/api/python/graph/Weight): Represents a weight value in a graph. ## Modules * [`ops`](/max/api/python/graph/ops): Ops you can add when staging a graph. * [`quantization`](/max/api/python/graph/quantization): APIs to quantize graph tensors. * [`type`](/max/api/python/graph/type): APIs for graph value types. --- ## Graph ## `Graph` {#max.graph.Graph} > *class* max.graph.Graph(name, forward=None, input\_types=(), path=None, \*args, custom\_extensions=\[], context=None, kernel\_library=None, module=None, \*\*kwargs) Represents a single MAX graph. A Graph is a callable routine in MAX Engine. Like functions, graphs have a name and signature. Unlike a function, which follows an imperative programming model, a Graph follows a dataflow programming model, using lazily-executed, parallel operations instead of sequential instructions. When you instantiate a graph, you must specify the input shapes as one or more `TensorType` values. Then, build a sequence of ops and set the graph output with [`output()`](#max.graph.Graph.output). For example: ```python from dataclasses import dataclass import numpy as np from max.dtype import DType from max.graph import Graph, TensorType, TensorValue, ops @dataclass class Linear: weight: np.ndarray bias: np.ndarray def __call__(self, x: TensorValue) -> TensorValue: weight_tensor = ops.constant(self.weight, dtype=DType.float32, device=DeviceRef.CPU()) bias_tensor = ops.constant(self.bias, dtype=DType.float32, device=DeviceRef.CPU()) return ops.matmul(x, weight_tensor) + bias_tensor linear_graph = Graph( "linear", Linear(np.ones((2, 2)), np.ones((2,))), input_types=[TensorType(DType.float32, (2,))] ) ``` You can’t call a Graph directly from Python. You must compile it and execute it with MAX Engine. For more detail, see the tutorial about how to [build a graph with MAX Graph](/max/tutorials/get-started-with-max-graph-in-python). When creating a graph, a global sequence of chains is initialized and stored in Graph.\_current\_chain. Every side-effecting op, e.g. buffer\_load, store\_buffer, load\_slice\_buffer, store\_slice\_buffer, will use the current chain to perform the op and and update Graph.\_current\_chain with a new chain. Currently, the input/output chains for mutable ops can be used at most once. The goal of this design choice is to prevent data races. **Parameters:** * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **forward** (`Optional` `[` `Callable` `]` ) * **input\_types** (`Iterable` `[` [`Type`](type.md#max.graph.type.Type) `]` ) * **path** (`Optional` `[` `Path` `]` ) * **custom\_extensions** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `Path` `]` ) * **context** (`Optional` `[` `mlir.Context` `]` ) * **kernel\_library** (`Optional` `[` [`KernelLibrary`](KernelLibrary.md#max.graph.KernelLibrary) `]` ) * **module** (`Optional` `[` `mlir.Module` `]` ) ### `add_subgraph()` {#max.graph.Graph.add_subgraph} > add\_subgraph(name, forward=None, input\_types=(), path=None, custom\_extensions=\[]) **Parameters:** * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **forward** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `|` `None` ) * **input\_types** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Type`](type.md#max.graph.type.Type) `]` ) * **path** ([`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `|` `None` ) * **custom\_extensions** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `]` ) **Return type:** [*Graph*](#max.graph.Graph) ### `add_weight()` {#max.graph.Graph.add_weight} > add\_weight(weight, force\_initial\_weight\_on\_host=True) Adds a weight to the graph. If the weight is in the graph already, return the existing value. **Parameters:** * **weight** ([`Weight`](Weight.md#max.graph.Weight) ) – The weight to add to the graph. * **force\_initial\_weight\_on\_host** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If true, then forces weights to initially be allocated on host before being moved to the indicated device. This is needed as a stop gap until we have a more fleshed out ownership model of external constants. **Returns:** A [`TensorValue`](TensorValue.md#max.graph.TensorValue) that contains this weight. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If a weight with the same name already exists in the graph. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `current` {#max.graph.Graph.current} > current ### `inputs` {#max.graph.Graph.inputs} > *property* inputs\*: [Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[Value](Value.md#max.graph.Value)]\* The input values of the graph. ### `kernel_libraries_paths` {#max.graph.Graph.kernel_libraries_paths} > *property* kernel\_libraries\_paths\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[Path](https://docs.python.org/3/library/pathlib.html#pathlib.Path)]\* Returns the list of extra kernel libraries paths for the custom ops. ### `local_weights_and_chain()` {#max.graph.Graph.local_weights_and_chain} > local\_weights\_and\_chain() ### `output()` {#max.graph.Graph.output} > output(\*outputs) Sets the output nodes of the [`Graph`](#max.graph.Graph). **Parameters:** **outputs** ([`Value`](Value.md#max.graph.Value) ) **Return type:** None ### `output_types` {#max.graph.Graph.output_types} > *property* output\_types\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[Type](type.md#max.graph.type.Type)]\* View of the types of the graph output terminator. --- ## GreaterThanComparable A type which can be greater than compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__gt__` `__gt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than `rhs`. --- ## GreaterThanOrEqualComparable A type which can be greater than or equal to compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__ge__` `__ge__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than or equal to `rhs`. --- ## Grid A grid is the top-level organizational structure of the threads executing a [kernel](kernel.mdx) function on a GPU. A grid consists of multiple [thread blocks](thread-block.mdx), which are further divided into individual [threads](thread.mdx) that execute the kernel function concurrently. The division of a grid into thread blocks serves multiple crucial purposes: - First, it breaks down the overall workload — managed by the grid — into smaller, more manageable portions that can be processed independently. This division allows for better resource utilization and scheduling flexibility across multiple [streaming multiprocessors](streaming-multiprocessor.mdx) (SMs) in the GPU. - Second, thread blocks provide a scope for threads to collaborate through shared memory and synchronization primitives, enabling efficient parallel algorithms and data sharing patterns. - Finally, thread blocks help with scalability by allowing the same program to run efficiently across different GPU architectures, as the hardware can automatically distribute blocks based on available resources. The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Typically, the programmer determines the dimensions of the grid based on the dimensionality of the data to process. For example, a programmer might choose a 1-dimensional grid for processing large vectors, a 2-dimensional grid for processing matrices, and a 3-dimensional grid for processing the frames of a video. Each block within the grid is assigned a unique [block index](block-index.mdx) that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique [thread index](thread-index.mdx) that determines its position within the block. The combination of block index and thread index uniquely identify the position of a thread within the overall grid. --- ## grid_controls Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs. This module provides low-level primitives for managing grid dependencies on NVIDIA Hopper architecture and newer GPUs. It enables efficient orchestration of multi-grid workloads by allowing grids to launch dependent grids and synchronize with them. The module includes functions that map directly to CUDA grid dependency control instructions, providing fine-grained control over grid execution order: * `launch_dependent_grids()`: Triggers execution of grids that depend on the current grid * `wait_on_dependent_grids()`: Blocks until all dependent grids complete execution These primitives are essential for implementing complex GPU execution pipelines where multiple kernels need to execute in a specific order with minimal overhead. They eliminate the need for host-side synchronization when orchestrating dependent GPU work. ## Structs * [​`PDL`](/mojo/stdlib/gpu/grid_controls/PDL): Programmatic Dependency Launch (PDL) control structure. * [​`PDLLevel`](/mojo/stdlib/gpu/grid_controls/PDLLevel): Programmatic Dependency Launch (PDL) level. ## Functions * [​`launch_dependent_grids`](/mojo/stdlib/gpu/grid_controls/launch_dependent_grids): Launches dependent grids that were previously configured to depend on the current grid. * [​`wait_on_dependent_grids`](/mojo/stdlib/gpu/grid_controls/wait_on_dependent_grids): Waits for all dependent grids launched by this grid to complete execution. --- ## group_norm Group Normalization implementation using the graph API. ## `GroupNorm` {#max.nn.norm.group_norm.GroupNorm} > *class* max.nn.norm.group\_norm.GroupNorm(num\_groups, num\_channels, eps=1e-05, affine=True, device=cpu:0) Group normalization block. Divides channels into groups and computes normalization stats per group. Follows the implementation pattern from PyTorch’s group\_norm. **Parameters:** * **num\_groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of groups to separate the channels into * **num\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of input channels * **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Small constant added to denominator for numerical stability * **affine** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If True, apply learnable affine transform parameters * **device** (`DeviceRef` ) --- ## grouped_matmul `grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)` --- ## grouped_matmul ## Aliases ### `NumWarpPerWarpGroup` `alias NumWarpPerWarpGroup = 4` ### `WARP_GROUP_SIZE` `alias WARP_GROUP_SIZE = 128` ## Functions * [​`default_config_sm90`](./default_config_sm90): * [​`grouped_matmul`](./grouped_matmul): * [​`grouped_matmul_kernel`](./grouped_matmul_kernel): * [​`grouped_matmul_sm90`](./grouped_matmul_sm90): * [​`naive_grouped_matmul`](./naive_grouped_matmul): * [​`naive_grouped_matmul_kernel`](./naive_grouped_matmul_kernel): --- ## grouped_matmul_kernel `grouped_matmul_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_smem_layout, c_desc_layout], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])` --- ## grouped_matmul_sm90 `grouped_matmul_sm90[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True, wgmma_shape: IndexList[3] = Index(64, 256, 16), config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape] = default_config_sm90[::DType,::DType,::DType,::Bool,::IndexList[::Int(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], num_active_experts: Int, ctx: DeviceContext)` --- ## Handle `struct Handle[backend: Backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()]` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `resolved_backend` `alias resolved_backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()` ### `type` `alias type = Variant[UnsafePointer[NoneType], UnsafePointer[NoneType], Handle, UnsafePointer[NoneType]]` ## Methods ### `__init__` `__init__(out self)` ### `__is__` `__is__(self, other: Backend) -> Bool` ### `__isnot__` `__isnot__(self, other: Backend) -> Bool` ### `__enter__` `__enter__(self) -> Self` ### `__exit__` `__exit__(mut self)` --- ## has_accelerator `has_accelerator() -> Bool` Returns True if the host system has an accelerator and False otherwise. **Returns:** True if the host system has an accelerator. --- ## has_amd_gpu_accelerator `has_amd_gpu_accelerator() -> Bool` Returns True if the host system has an AMD GPU and False otherwise. **Returns:** True if the host system has an AMD GPU. --- ## has_avx `has_avx() -> Bool` Returns True if the host system has AVX, otherwise returns False. **Returns:** True if the host system has AVX, otherwise returns False. --- ## has_avx2 `has_avx2() -> Bool` Returns True if the host system has AVX2, otherwise returns False. **Returns:** True if the host system has AVX2, otherwise returns False. --- ## has_avx512f `has_avx512f() -> Bool` Returns True if the host system has AVX512, otherwise returns False. **Returns:** True if the host system has AVX512, otherwise returns False. --- ## has_fma `has_fma() -> Bool` Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False. **Returns:** True if the host system has FMA support, otherwise returns False. --- ## has_intel_amx `has_intel_amx() -> Bool` Returns True if the host system has Intel AMX support, otherwise returns False. **Returns:** True if the host system has Intel AMX and False otherwise. --- ## has_neon `has_neon() -> Bool` Returns True if the host system has Neon support, otherwise returns False. **Returns:** True if the host system support the Neon instruction set. --- ## has_neon_int8_dotprod `has_neon_int8_dotprod() -> Bool` Returns True if the host system has the Neon int8 dot product extension, otherwise returns False. **Returns:** True if the host system support the Neon int8 dot product extension and False otherwise. --- ## has_neon_int8_matmul `has_neon_int8_matmul() -> Bool` Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False. **Returns:** True if the host system support the Neon int8 matrix multiplication extension (I8MM) and False otherwise. --- ## has_nvidia_gpu_accelerator `has_nvidia_gpu_accelerator() -> Bool` Returns True if the host system has an NVIDIA GPU and False otherwise. **Returns:** True if the host system has an NVIDIA GPU. --- ## has_sse4 `has_sse4() -> Bool` Returns True if the host system has sse4, otherwise returns False. **Deprecated:** Use `CompilationTarget.has_sse4()` instead. **Returns:** True if the host system has sse4, otherwise returns False. --- ## has_vnni `has_vnni() -> Bool` Returns True if the host system has avx512\_vnni, otherwise returns False. **Returns:** True if the host system has avx512\_vnni, otherwise returns False. --- ## hash `hash[T: Hashable](hashable: T) -> UInt` Hash a Hashable type using its underlying hash implementation. **Parameters:** * ​T (`Hashable`): Any Hashable type. **Args:** * ​hashable (`T`): The input data to hash. **Returns:** A 64-bit integer hash based on the underlying implementation. `hash(bytes: UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin], n: Int) -> UInt` Hash a byte array using a SIMD-modified DJBX33A hash algorithm. *This hash function is not suitable for cryptographic purposes.* The algorithm is easy to reverse and produce deliberate hash collisions. The hash function is designed to have relatively good mixing and statistical properties for use in hash-based data structures. We *do* however initialize a random hash secret which is mixed into the final hash output. This can help prevent DDOS attacks on applications which make use of this function for dictionary hashing. As a consequence, hash values are deterministic within an individual runtime instance ie. a value will always hash to the same thing, but in between runs this value will change based on the hash secret. We take advantage of Mojo's first-class SIMD support to create a SIMD-vectorized hash function, using some simple hash algorithm as a base. * Interpret those bytes as a SIMD vector, padded with zeros to align to the system SIMD width. * Apply the simple hash function parallelized across SIMD vectors. * Hash the final SIMD vector state to reduce to a single value. Python uses DJBX33A with a hash secret for smaller strings, and then the SipHash algorithm for longer strings. The arguments and tradeoffs are well documented in PEP 456. We should consider this and deeper performance/security tradeoffs as Mojo evolves. References: * [Wikipedia: Non-cryptographic hash function](https://en.wikipedia.org/wiki/Non-cryptographic_hash_function) * [Python PEP 456](https://peps.python.org/pep-0456/) * [PHP Hash algorithm and collisions](https://www.phpinternalsbook.com/php5/hashtables/hash_algorithm.html) ```mojo from random import rand var n = 64 var rand_bytes = UnsafePointer[UInt8].alloc(n) rand(rand_bytes, n) hash(rand_bytes, n) ``` **Args:** * ​bytes (`UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin]`): The byte array to hash. * ​n (`Int`): The length of the byte array. **Returns:** A 64-bit integer hash. This hash is *not* suitable for cryptographic purposes, but will have good low-bit hash collision statistical properties for common data structures. --- ## hash Implements the `Hashable` trait and `hash()` built-in function. There are a few main tools in this module: * `Hashable` trait for types implementing `__hash__(self) -> UInt` * `hash[T: Hashable](hashable: T) -> Int` built-in function. * A `hash()` implementation for arbitrary byte strings, `hash(data: UnsafePointer[UInt8], n: Int) -> Int`, is the workhorse function, which implements efficient hashing via SIMD vectors. See the documentation of this function for more details on the hash implementation. * `hash(SIMD)` and `hash(UInt8)` implementations These are useful helpers to specialize for the general bytes implementation. ## Traits * [​`Hashable`](/mojo/stdlib/hashlib/hash/Hashable): A trait for types which specify a function to hash their data. ## Functions * [​`hash`](/mojo/stdlib/hashlib/hash/hash): Hash a Hashable type using its underlying hash implementation. --- ## Hashable A trait for types which specify a function to hash their data. This hash function will be used for applications like hash maps, and don't need to be cryptographically secure. A good hash function will hash similar / common types to different values, and in particular the *low order bits* of the hash, which are used in smaller dictionaries, should be sensitive to any changes in the data structure. If your type's hash function doesn't meet this criteria it will get poor performance in common hash map implementations. ```mojo @value struct Foo(Hashable): fn __hash__(self) -> UInt: return 4 # chosen by fair random dice roll var foo = Foo() print(hash(foo)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__hash__` `__hash__(self: _Self) -> UInt` Return a 64-bit hash of the type's data. **Returns:** A 64-bit integer hash of this instance's data. --- ## hashlib Implements the hashlib package that provides various hash algorithms. ## Modules * [​`hash`](/mojo/stdlib/hashlib/hash/): Implements the `Hashable` trait and `hash()` built-in function. --- ## hex `hex(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String` Returns the hex string representation of the given integer. The hexadecimal representation is a base-16 encoding of the integer value. The returned string will be prefixed with "0x" to indicate that the subsequent digits are hex. **Args:** * ​value (`SIMD[dtype, 1]`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the hex representation of the given integer. `hex[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String` Returns the hex string representation of the given integer. The hexadecimal representation is a base-16 encoding of the integer value. The returned string will be prefixed with "0x" to indicate that the subsequent digits are hex. **Parameters:** * ​T (`Intable`): The indexer type to represent in hexadecimal. **Args:** * ​value (`T`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the hex representation of the given integer. `hex(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String` Returns the hex string representation of the given scalar bool. The hexadecimal representation is a base-16 encoding of the bool. The returned string will be prefixed with "0x" to indicate that the subsequent digits are hex. **Args:** * ​value (`SIMD[bool, 1]`): The bool value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the hex representation of the given bool. --- ## hf ## `ContinuousHFStaticCache` {#max.nn.kv_cache.hf.ContinuousHFStaticCache} > *class* max.nn.kv\_cache.hf.ContinuousHFStaticCache(config, max\_batch\_size, max\_seq\_len, device, dtype=torch.float32, layer\_device\_map=None) **Parameters:** * **config** (`PretrainedConfig` ) * **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`device` ) * **dtype** (`dtype` ) * **layer\_device\_map** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `device` `|` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `|` `None` ) ### `external_claim()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.external_claim} > external\_claim(seq\_ids) **Parameters:** **seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) **Return type:** None ### `get_attention_mask()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.get_attention_mask} > get\_attention\_mask(seq\_ids) **Parameters:** **seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) **Return type:** *Tensor* ### `release()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.release} > release(seq\_id) **Parameters:** **seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `reset()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.reset} > reset() Resets the cache values while preserving the objects **Return type:** None ### `set_active_slots()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.set_active_slots} > set\_active\_slots(seq\_ids) **Parameters:** **seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) **Return type:** None ### `set_cache_position()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.set_cache_position} > set\_cache\_position(cache\_position) **Parameters:** **cache\_position** (`Tensor` ) ### `update()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.update} > update(key\_states, value\_states, layer\_idx, cache\_kwargs=None) Updates the cache with the new key\_states and value\_states for the layer layer\_idx. It is VERY important to index using a tensor, otherwise you introduce a copy to the device. **Parameters:** * **key\_states** (torch.Tensor) – The new key states to cache. * **value\_states** (torch.Tensor) – The new value states to cache. * **layer\_idx** (int) – The index of the layer to cache the states for. * **cache\_kwargs** (Dict\[str, Any], optional) – Additional arguments for the cache subclass. The StaticCache needs the cache\_position input to know how where to write in the cache. **Returns:** A tuple containing the updated key and value states. **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[*Tensor*, *Tensor*] ### `update_attention_pattern()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.update_attention_pattern} > update\_attention\_pattern(seq\_id, attention\_mask) **Parameters:** * **seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **attention\_mask** (`Tensor` ) **Return type:** None --- ## hf_pipeline Generalized Token Generation Pipeline ## `HFEmbeddingsPipeline` {#max.pipelines.lib.hf_pipeline.HFEmbeddingsPipeline} > *class* max.pipelines.lib.hf\_pipeline.HFEmbeddingsPipeline(pipeline\_config, torch\_device\_type) Generalized token generator pipeline. **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **torch\_device\_type** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) ### `encode()` {#max.pipelines.lib.hf_pipeline.HFEmbeddingsPipeline.encode} > encode(batch) Encodes a batch of text inputs. **Parameters:** **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`TextContext`](core.md#max.pipelines.core.TextContext) `]` ) **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*EmbeddingsResponse*](core.md#max.pipelines.core.EmbeddingsResponse)] ### `prepare_initial_token_inputs()` {#max.pipelines.lib.hf_pipeline.HFEmbeddingsPipeline.prepare_initial_token_inputs} > prepare\_initial\_token\_inputs(context\_batch) **Parameters:** **context\_batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TextContext`](core.md#max.pipelines.core.TextContext) `]` ) **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[*Tensor*, *Tensor*] ## `HFTextGenerationPipeline` {#max.pipelines.lib.hf_pipeline.HFTextGenerationPipeline} > *class* max.pipelines.lib.hf\_pipeline.HFTextGenerationPipeline(pipeline\_config, torch\_device\_type) HuggingFace text token generator pipeline. **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **torch\_device\_type** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) ### `next_token()` {#max.pipelines.lib.hf_pipeline.HFTextGenerationPipeline.next_token} > next\_token(batch, num\_steps) Provided a batch, process batch inputs, execute the graph for num\_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens. **Parameters:** * **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`TextContext`](core.md#max.pipelines.core.TextContext) `]` ) * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*TextGenerationResponse*](core.md#max.pipelines.core.TextGenerationResponse)] ### `release()` {#max.pipelines.lib.hf_pipeline.HFTextGenerationPipeline.release} > release(context) Releases resources associated with this context. **Parameters:** **context** (`TokenGeneratorContext` ) – Finished context. **Return type:** None --- ## hf_utils Utilities for interacting with HuggingFace Files/Repos. ## `HuggingFaceFile` {#max.pipelines.lib.hf_utils.HuggingFaceFile} > *class* max.pipelines.lib.hf\_utils.HuggingFaceFile(repo\_id, filename, revision=None) A simple object for tracking Hugging Face model metadata. The repo\_id will frequently be used to load a tokenizer, whereas the filename is used to download model weights. **Parameters:** * **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) ### `download()` {#max.pipelines.lib.hf_utils.HuggingFaceFile.download} > download(force\_download=False) Download the file and return the file path where the data is saved locally. **Parameters:** **force\_download** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path) ### `exists()` {#max.pipelines.lib.hf_utils.HuggingFaceFile.exists} > exists() **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ### `filename` {#max.pipelines.lib.hf_utils.HuggingFaceFile.filename} > filename\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* ### `repo_id` {#max.pipelines.lib.hf_utils.HuggingFaceFile.repo_id} > repo\_id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* ### `revision` {#max.pipelines.lib.hf_utils.HuggingFaceFile.revision} > revision\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `size()` {#max.pipelines.lib.hf_utils.HuggingFaceFile.size} > size() **Return type:** [int](https://docs.python.org/3/library/functions.html#int) | None ## `HuggingFaceRepo` {#max.pipelines.lib.hf_utils.HuggingFaceRepo} > *class* max.pipelines.lib.hf\_utils.HuggingFaceRepo(repo\_id, revision='main', trust\_remote\_code=False, repo\_type=None) A class for interacting with HuggingFace Repos. **Parameters:** * **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **trust\_remote\_code** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **repo\_type** (`RepoType` `|` `None` ) ### `download()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.download} > download(filename, force\_download=False) **Parameters:** * **filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **force\_download** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path) ### `encoding_for_file()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.encoding_for_file} > encoding\_for\_file(file) **Parameters:** **file** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) ) **Return type:** *SupportedEncoding* ### `file_exists()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.file_exists} > file\_exists(filename) **Parameters:** **filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ### `files_for_encoding()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.files_for_encoding} > files\_for\_encoding(encoding, weights\_format=None) **Parameters:** * **encoding** (`SupportedEncoding` ) * **weights\_format** (`WeightsFormat` `|` `None` ) **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[*WeightsFormat*, [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)]] ### `formats_available` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.formats_available} > *property* formats\_available\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[WeightsFormat]\* ### `info` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.info} > *property* info\*: ModelInfo\* ### `repo_id` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.repo_id} > repo\_id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* The HuggingFace repo id. While it’s called repo\_id, it can be a HF remote or local path altogether. ### `repo_type` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.repo_type} > repo\_type\*: RepoType | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The type of repo. This is inferred from the repo\_id. ### `revision` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.revision} > revision\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= 'main'* The revision to use for the repo. ### `size_of()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.size_of} > size\_of(filename) **Parameters:** **filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) | None ### `supported_encodings` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.supported_encodings} > *property* supported\_encodings\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[SupportedEncoding]\* ### `trust_remote_code` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.trust_remote_code} > trust\_remote\_code\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* Whether to trust remote code. ### `weight_files` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.weight_files} > *property* weight\_files\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[WeightsFormat, [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)]]\* ## `download_weight_files()` {#max.pipelines.lib.hf_utils.download_weight_files} > max.pipelines.lib.hf\_utils.download\_weight\_files(huggingface\_model\_id, filenames, revision=None, force\_download=False, max\_workers=8) Provided a HuggingFace model id, and filenames, download weight files : and return the list of local paths. **Parameters:** * **huggingface\_model\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The huggingface model identifier, ie. modularai/Llama-3.1-8B-Instruct-GGUF * **filenames** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – A list of file paths relative to the root of the HuggingFace repo. If files provided are available locally, download is skipped, and the local files are used. * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) – The HuggingFace revision to use. If provided, we check our cache directly without needing to go to HuggingFace directly, saving a network call. * **force\_download** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – A boolean, indicating whether we should force the files to be redownloaded, even if they are already available in our local cache, or a provided path. * **max\_workers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of worker threads to concurrently download files. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)] ## `generate_local_model_path()` {#max.pipelines.lib.hf_utils.generate_local_model_path} > max.pipelines.lib.hf\_utils.generate\_local\_model\_path(repo\_id, revision) Generate the local filesystem path where a HuggingFace model repo is cached. This function takes a HuggingFace repository ID and revision hash and returns the full local filesystem path where the model files are cached by the huggingface\_hub library. The path follows the standard HuggingFace caching convention of: \~/.cache/huggingface/hub/models–{org}–{model}/snapshots/{revision} **Parameters:** * **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The HuggingFace repository ID in the format “org/model” (e.g. “HuggingFaceTB/SmolLM2-135M”) * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The specific model revision hash to use, typically from a repo lock file **Returns:** The absolute path to the cached model files for the specified revision. For example: “\~/.cache/huggingface/hub/models–HuggingFaceTB–SmolLM2-135M/snapshots/abc123” **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) **Raises:** [**FileNotFoundError**](https://docs.python.org/3/library/exceptions.html#FileNotFoundError) – If the model path does not exist locally ## `repo_exists_with_retry()` {#max.pipelines.lib.hf_utils.repo_exists_with_retry} > max.pipelines.lib.hf\_utils.repo\_exists\_with\_retry(repo\_id, revision) Wrapper around huggingface\_hub.revision\_exists with retry logic. Uses exponential backoff with 25% jitter, starting at 1s and doubling each retry. We use revision\_exists here instead of repo\_exists because repo\_exists does not take in a revision parameter. See huggingface\_hub.revision\_exists for details **Parameters:** * **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) --- ## hierarchical_unzip `hierarchical_unzip(layout_a: Layout, tiler: List[Layout]) -> Layout` Hierarchically unzips a layout according to a list of layouts. This function creates a hierarchical layout by unzipping the first layout according to the layouts in the tiler list. It's useful for decomposing a layout into hierarchical components for more efficient memory access patterns or to enable specialized tensor operations. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import hierarchical_unzip # Create a layout to unzip var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2))) var result = hierarchical_unzip(base, tilers) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be unzipped. * ​tiler (`List[Layout]`): A list of layouts defining the unzipping patterns. **Returns:** A new layout representing the hierarchical unzipping with components from both the original layout and the tiler layouts. `hierarchical_unzip(layout_a: Layout, layout_b: Layout) -> Layout` Hierarchically unzips a layout according to another layout. This function creates a hierarchical layout by unzipping the first layout according to the second layout. It's a fundamental operation for decomposing a layout into hierarchical components, which enables more efficient memory access patterns for various tensor operations. Example: ```mojo from layout import Layout, IntTuple from layout.layout import hierarchical_unzip # Create layouts var base = Layout.row_major(6, 8) var pattern = Layout(IntTuple(2, 2)) var result = hierarchical_unzip(base, pattern) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be unzipped. * ​layout\_b (`Layout`): The layout defining the unzipping pattern. **Returns:** A new layout representing the hierarchical unzipping of layout\_a according to the pattern defined by layout\_b. --- ## hopper_matmul_tma_wgmma `hopper_matmul_tma_wgmma[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], block_tile_shape: IndexList[3]](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)` --- ## hopper_matmul_tma_wgmma_kernel `hopper_matmul_tma_wgmma_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, transpose_b: Bool = True, promotion_frequency: Int = 1](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])` --- ## host Implements the gpu host package. ## Modules * [​`constant_memory_mapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/): This module provides functionality for mapping constant memory between host and device. * [​`device_attribute`](/mojo/stdlib/gpu/host/device_attribute/): This module defines GPU device attributes that can be queried from CUDA-compatible devices. * [​`device_context`](/mojo/stdlib/gpu/host/device_context/): This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator. * [​`dim`](/mojo/stdlib/gpu/host/dim/): This module implements the dim type. * [​`func_attribute`](/mojo/stdlib/gpu/host/func_attribute/): GPU Kernel Function Attributes Module * [​`info`](/mojo/stdlib/gpu/host/info/): Contains information about GPU architectures and their capabilities. * [​`launch_attribute`](/mojo/stdlib/gpu/host/launch_attribute/): GPU Launch Attributes for Kernel Execution Control --- ## HostBuffer `struct HostBuffer[type: DType]` Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory. To allocate a `HostBuffer`, use one of the methods provided by `DeviceContext`, such as [`enqueue_create_host_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_host_buffer). ## Parameters * ​type (`DType`): Data type to be stored in the buffer. ## Implemented traits `AnyType`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a copy of an existing host buffer by incrementing its reference count. This copy constructor creates a new reference to the same underlying host buffer by incrementing the reference count of the native buffer object. Both the original and the copy will refer to the same memory on the device. **Args:** * ​existing (`Self`): The host buffer to copy. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Initializes this buffer by taking ownership of an existing buffer. This move constructor transfers ownership of the device buffer from the existing instance to the new instance without incrementing the reference count. **Args:** * ​existing (`Self`): The buffer to move from, which will no longer be valid after this call. ### `__del__` `__del__(owned self)` Releases resources associated with this host buffer. This function schedules an owned buffer free using the stream in the device context. The actual deallocation may occur asynchronously after all operations using this buffer have completed. ### `__getitem__` `__getitem__(self, idx: Int) -> SIMD[type, 1]` Retrieves the element at the specified index from the host buffer. This operator allows direct access to individual elements in the host buffer using array indexing syntax. **Args:** * ​idx (`Int`): The index of the element to retrieve. **Returns:** The scalar value at the specified index. ### `__setitem__` `__setitem__(self, idx: Int, val: SIMD[type, 1])` Sets the element at the specified index in the host buffer. This operator allows direct modification of individual elements in the host buffer using array indexing syntax. **Args:** * ​idx (`Int`): The index of the element to modify. * ​val (`SIMD[type, 1]`): The new value to store at the specified index. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `__len__` `__len__(self) -> Int` Returns the number of elements in this buffer. This method calculates the number of elements by dividing the total byte size of the buffer by the size of each element. **Returns:** The number of elements in the buffer. ### `create_sub_buffer` `create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> HostBuffer[view_type]` Creates a sub-buffer view of this buffer with a different element type. This method creates a new buffer that references a subset of the memory in this buffer, potentially with a different element type. The sub-buffer shares the underlying memory with the original buffer. **Parameters:** * ​view\_type (`DType`): The data type for elements in the new sub-buffer. **Args:** * ​offset (`Int`): The starting offset in elements from the beginning of this buffer. * ​size (`Int`): The number of elements in the new sub-buffer. **Returns:** A new HostBuffer referencing the specified region with the specified element type. ### `enqueue_copy_to` `enqueue_copy_to(self, dst: Self)` Enqueues an asynchronous copy from this buffer to another host buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`Self`): The destination host buffer to copy data to. `enqueue_copy_to(self, dst: DeviceBuffer[type])` Enqueues an asynchronous copy from this buffer to a device buffer. This method schedules a memory copy operation from this buffer to the destination buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst (`DeviceBuffer[type]`): The destination device buffer to copy data to. `enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy from this buffer to host memory. This method schedules a memory copy operation from this device buffer to the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location. ### `enqueue_copy_from` `enqueue_copy_from(self, src: Self)` Enqueues an asynchronous copy to this buffer from another host buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`Self`): The source host buffer to copy data from. `enqueue_copy_from(self, src: DeviceBuffer[type])` Enqueues an asynchronous copy to this buffer from a device buffer. This method schedules a memory copy operation to this buffer from the source buffer. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src (`DeviceBuffer[type]`): The source device buffer to copy data from. `enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])` Enqueues an asynchronous copy to this buffer from host memory. This method schedules a memory copy operation to this device buffer from the specified host memory location. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location. ### `enqueue_fill` `enqueue_fill(self, val: SIMD[type, 1]) -> Self` Enqueues an operation to fill this buffer with a specified value. This method schedules a memory set operation that fills the entire buffer with the specified value. The operation is asynchronous and will be executed in the stream associated with this buffer's context. **Args:** * ​val (`SIMD[type, 1]`): The value to fill the buffer with. **Returns:** Self reference for method chaining. ### `reassign_ownership_to` `reassign_ownership_to(self, ctx: DeviceContext)` Transfers ownership of this buffer to another device context. This method changes the device context that owns this buffer. This can be useful when sharing buffers between different contexts or when migrating workloads between devices. **Args:** * ​ctx (`DeviceContext`): The new device context to take ownership of this buffer. ### `take_ptr` `take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]` Takes ownership of the device pointer from this buffer. This method releases the device pointer from the buffer's control and returns it to the caller. After this call, the buffer no longer owns the pointer, and the caller is responsible for managing its lifecycle. **Returns:** The raw device pointer that was owned by this buffer. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]` Returns the raw device pointer without transferring ownership. This method provides direct access to the underlying device pointer for advanced use cases. The buffer retains ownership of the pointer. **Returns:** The raw device pointer owned by this buffer. ### `context` `context(self) -> DeviceContext` Returns the device context associated with this buffer. This method retrieves the device context that owns this buffer and is responsible for managing its lifecycle and operations. **Returns:** The device context associated with this buffer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of this buffer to the provided writer. This method formats the buffer's contents as a string and writes it to the specified writer. For large buffers, a compact representation is used. **Parameters:** * ​W (`Writer`): The writer type. **Args:** * ​writer (`W`): The writer to output the formatted string to. ### `__str__` `__str__(self) -> String` Returns a string representation of the `HostBuffer`. This method creates a human-readable string representation of the buffer's contents by mapping the device memory to host memory and formatting the elements. **Returns:** A string containing the formatted buffer contents. ### `as_span` `as_span(ref self) -> Span[SIMD[type, 1], self_is_origin]` Returns a `Span` pointing to the underlying memory of the `HostBuffer`. **Returns:** A `Span` pointing to the underlying memory of the `HostBuffer`. --- ## hypot `hypot[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `hypot` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​arg0 (`SIMD[dtype, width]`): The first input argument. * ​arg1 (`SIMD[dtype, width]`): The second input argument. **Returns:** The `hypot` of the inputs. --- ## id This module provides GPU thread and block indexing functionality. It defines aliases and functions for accessing GPU grid, block, thread and cluster dimensions and indices. These are essential primitives for GPU programming that allow code to determine its position and dimensions within the GPU execution hierarchy. Most functionality is architecture-agnostic, with some NVIDIA-specific features clearly marked. The module is designed to work seamlessly across different GPU architectures while providing optimal performance through hardware-specific optimizations where applicable. ## Aliases ### `block_dim` `alias block_dim = _BlockDim()` Contains the dimensions of the block as `x`, `y`, and `z` values (for example, `block_dim.y`) ### `block_id_in_cluster` `alias block_id_in_cluster = _Cluster_BlockIdx()` Contains the block id of the threadblock within a cluster, as `x`, `y`, and `z` values. ### `block_idx` `alias block_idx = _BlockIdx()` Contains the block index in the grid, as `x`, `y`, and `z` values. ### `cluster_dim` `alias cluster_dim = _ClusterDim()` Contains the dimensions of the cluster, as `x`, `y`, and `z` values. ### `cluster_idx` `alias cluster_idx = _ClusterIdx()` Contains the cluster index in the grid, as `x`, `y`, and `z` values. ### `global_idx` `alias global_idx = _GridIdx()` Contains the global offset of the kernel launch, as `x`, `y`, and `z` values. ### `grid_dim` `alias grid_dim = _GridDim()` Provides accessors for getting the `x`, `y`, and `z` dimensions of a grid. ### `thread_idx` `alias thread_idx = _ThreadIdx()` Contains the thread index in the block, as `x`, `y`, and `z` values. ## Functions * [​`lane_id`](/mojo/stdlib/gpu/id/lane_id): Returns the lane ID of the current thread within its warp. * [​`sm_id`](/mojo/stdlib/gpu/id/sm_id): Returns the Streaming Multiprocessor (SM) ID of the current thread. * [​`warp_id`](/mojo/stdlib/gpu/id/warp_id): Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block. --- ## identifiable ## Traits * [​`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable): The Identifiable trait denotes a type with an identity which can be compared with other instances of itself. * [​`TypeIdentifiable`](/mojo/stdlib/builtin/identifiable/TypeIdentifiable): Denotes a type that can be uniquely identified. --- ## Identifiable The Identifiable trait denotes a type with an identity which can be compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__is__` `__is__(self: _Self, rhs: _Self) -> Bool` Define whether `self` has the same identity as `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is `rhs`. ### `__isnot__` `__isnot__(self: _Self, rhs: _Self) -> Bool` Define whether `self` has a different identity than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is not `rhs`. --- ## identity `identity(x: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## IdentityScoreMod `@register_passable(trivial)` `struct IdentityScoreMod` IdentityScoreMod simply returns attention score. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `ScoreModTrait`, `UnknownDestructibility` ## Aliases ### `name_str` `alias name_str = __init__[__mlir_type.!kgen.string]("no_pos")` ## Methods ### `score_mod` `score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]` --- ## idx2crd `idx2crd(idx: IntTuple[origin], shape: IntTuple[origin]) -> IntTuple` Converts a linear index to a coordinate tuple within a given shape. This function splits an index into a coordinate within a Shape via a colexicographical enumeration of coordinates in Shape. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. **Returns:** A new `IntTuple` containing the coordinates corresponding to the linear index. `idx2crd(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple` Converts a linear index to a coordinate tuple within a given shape using custom strides. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. * ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion. **Returns:** A new `IntTuple` containing the coordinates corresponding to the linear index. --- ## idx2crd `idx2crd[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, stride_t), element_type=element_type]` Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements. **Constraints:** The index must be a scalar value (not a tuple). **Parameters:** * ​idx\_t (`IntTuple[$2]`): IntTuple type of the index. * ​shape\_t (`IntTuple[$1]`): IntTuple type of the shape. * ​stride\_t (`IntTuple[$0]`): IntTuple type of the stride. **Args:** * ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert. * ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array. * ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension. **Returns:** A `RuntimeTuple` containing the multi-dimensional coordinates. `idx2crd[: ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$1], shape_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, prefix_product[::Origin[::Bool(shape_t)), element_type=element_type]` Converts a linear index to multi-dimensional coordinates using shape-derived strides. This is a convenience overload of `idx2crd` that automatically calculates the stride values from the shape using `prefix_product`. This is the common case for row-major storage order tensors. **Parameters:** * ​idx\_t (`IntTuple[$1]`): IntTuple type of the index. * ​shape\_t (`IntTuple[$0]`): IntTuple type of the shape. **Args:** * ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert. * ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array. **Returns:** A `RuntimeTuple` containing the multi-dimensional coordinates calculated using automatically derived strides from the shape. --- ## idx2crd2 `idx2crd2(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple` Convert a linear index to coordinates. This function handles the actual conversion logic for different input combinations. Notes: * Handles four cases: tuple-tuple-tuple, tuple-int-int, int-tuple-tuple, and int-int-int. * When input shapes don't match, `abort()` will be called. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. * ​shape (`IntTuple[origin]`): The shape of the tensor/array. * ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion. If empty, strides are computed from the shape using prefix\_product. **Returns:** A new IntTuple containing the coordinates corresponding to the linear index. --- ## image ## Structs * [​`Image2DLayout`](./Image2DLayout): * [​`ImageData`](./ImageData): Utility class that generalizes conv2d data and filter tensor with a given data layout. * [​`ImageShape`](./ImageShape): A data-layout agnostic representation of tensor shapes used in conv2d. * [​`PadHandling`](./PadHandling): --- ## Image2DLayout `@register_passable(trivial)` `struct Image2DLayout` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `FRSCf` `alias FRSCf = Image2DLayout(3)` ### `NCHW` `alias NCHW = Image2DLayout(1)` ### `NHWC` `alias NHWC = Image2DLayout(0)` ### `RSCF` `alias RSCF = Image2DLayout(2)` ### `UNKNOWN` `alias UNKNOWN = Image2DLayout(-1)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## ImageData `@register_passable(trivial)` `struct ImageData[shape: DimList, type: DType, static_layout: Image2DLayout, origin: MutableOrigin]` Utility class that generalizes conv2d data and filter tensor with a given data layout. ## Fields * ​data (`NDBuffer[type, 4, origin, shape]`): * ​dynamic\_layout (`Image2DLayout`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(data: NDBuffer[type, 4, origin, shape], layout: Image2DLayout) -> Self` Construct of an image data instance with dynamic layout param. **Args:** * ​data (`NDBuffer[type, 4, origin, shape]`): A 4d buffer containing the actual data. * ​layout (`Image2DLayout`): Data layout tag. `@implicit` `__init__(data: NDBuffer[type, 4, origin, shape]) -> Self` ### `__getitem__` `__getitem__(self, n: Int, c: Int, h: Int, w: Int) -> SIMD[type, 1]` Reads the underlying data buffer based on the tensor index and under- lying data layout. **Args:** * ​n (`Int`): Index on the batch dimension. * ​c (`Int`): Index on the channel dimension. * ​h (`Int`): Index on the height dimension. * ​w (`Int`): Index on the width dimension. **Returns:** The value stored at the given index position. ### `__setitem__` `__setitem__(self, n: Int, c: Int, h: Int, w: Int, value: SIMD[type, 1])` Writes the underlying data buffer based on the tensor index and under- lying data layout. **Args:** * ​n (`Int`): Index on the batch dimension. * ​c (`Int`): Index on the channel dimension. * ​h (`Int`): Index on the height dimension. * ​w (`Int`): Index on the width dimension. * ​value (`SIMD[type, 1]`): The value to store at the given index position. ### `to_static_layout` `to_static_layout[new_static_layout: Image2DLayout](self) -> ImageData[shape, type, new_static_layout, origin]` Conversion utility from a fully dynamic data structure, e.g. from c shim to one with compile-time known data layout. **Returns:** The image data with static data layout. ### `get_layout` `get_layout(self) -> Image2DLayout` The getter function of the underlying data layout, resolving from either statically or dynamically provided information. **Returns:** The resolved data layout tag for this image instance. ### `get_flat_index` `get_flat_index(self, n: Int, c: Int, h: Int, w: Int) -> Int` Converts the dimension index to the flat index of the underlying data based on the tensor layout. **Args:** * ​n (`Int`): Index on the batch dimension. * ​c (`Int`): Index on the channel dimension. * ​h (`Int`): Index on the height dimension. * ​w (`Int`): Index on the width dimension. **Returns:** An integer containing the index based on the underlying data layout. ### `get_tuple_index` `get_tuple_index(self, idx: Int) -> IndexList[4]` Converts the flat index to the dimension index of the underlying data based on the tensor layout. **Args:** * ​idx (`Int`): Flat index. **Returns:** A IndexList containing the index in NCHW order. ### `num_elements` `num_elements(self) -> Int` --- ## ImageShape `@register_passable(trivial)` `struct ImageShape` A data-layout agnostic representation of tensor shapes used in conv2d. ## Fields * ​N (`Int`): * ​C (`Int`): * ​H (`Int`): * ​W (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__[shape: DimList, type: DType, layout: Image2DLayout](image_data: ImageData[shape, type, layout, origin]) -> Self` Constructor of an ImageShape instance from an ImageData. **Args:** * ​image\_data (`ImageData[shape, type, layout, origin]`): The image data instance to extract shape info from. --- ## implicitarg_ptr `implicitarg_ptr() -> UnsafePointer[SIMD[uint8, 1], address_space=AddressSpace(4)]` Get a pointer to AMD's implicit arguments table. **Returns:** A pointer to LLVM's implicit arguments table. --- ## ImplicitlyBoolable The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`. Types conforming to this trait can be passed to a function that expects a `Bool` without explicitly converting to it. Accordingly, most types should conform to `Boolable` instead, since implicit conversions to `Bool` can have unintuitive consequences. This trait requires the type to implement the `__as_bool__()` method. For example: ```mojo struct Foo(ImplicitlyBoolable): var val: Bool fn __as_bool__(self) -> Bool: return self.val fn __bool__(self) -> Bool: return self.__as_bool__() ``` ## Implemented traits `AnyType`, `Boolable`, `UnknownDestructibility` ## Methods ### `__bool__` `__bool__(self: _Self) -> Bool` Get the boolean representation of the value. **Returns:** The boolean representation of the value. ### `__as_bool__` `__as_bool__(self: _Self) -> Bool` Get the boolean representation of the value. **Returns:** The boolean representation of the value. --- ## ImplicitlyIntable The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly. This trait requires the type to implement the `__as_int__()` method. For example: ```mojo struct Foo(ImplicitlyIntable): var i: Int fn __int__(self) -> Int: return self.i fn __as_int__(self) -> Int: return self.__int__() ``` Now you can use `Foo` anywhere that an `Int` is expected, e.g. equality checks: ```mojo foo = Foo(42) assert_equal(Int(42), foo) ``` ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__as_int__` `__as_int__(self: _Self) -> Int` Implicitly convert to an integral representation of the value, wherever an `Int` is expected. **Returns:** The integral representation of the value. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## index `index[T: Indexer](idx: T, /) -> index` Returns the value of `__index__` for the given value. **Parameters:** * ​T (`Indexer`): A type conforming to the `Indexer` trait. **Args:** * ​idx (`T`): The value. **Returns:** An `__mlir_type` representing the index value. --- ## index Implements `IndexList` which is commonly used to represent N-D indices. You can import these APIs from the `utils` package. For example: ```mojo from utils import IndexList ``` ## Structs * [​`IndexList`](/mojo/stdlib/utils/index_/IndexList): A base struct that implements size agnostic index functions. ## Functions * [​`Index`](/mojo/stdlib/utils/index_/Index-function): Constructs a 1-D Index from the given value. * [​`product`](/mojo/stdlib/utils/index_/product): Computes a product of values in the tuple up to the given index. --- ## Index `Index[T0: Intable, //, *, dtype: DType = int64](x: T0) -> IndexList[1, element_type=dtype]` Constructs a 1-D Index from the given value. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The initial value. **Returns:** The constructed IndexList. `Index[*, dtype: DType = int64](x: UInt) -> IndexList[1, element_type=dtype]` Constructs a 1-D Index from the given value. **Parameters:** * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`UInt`): The initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, //, *, dtype: DType = int64](x: T0, y: T1) -> IndexList[2, element_type=dtype]` Constructs a 2-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. **Returns:** The constructed IndexList. `Index[*, dtype: DType = int64](x: UInt, y: UInt) -> IndexList[2, element_type=dtype]` Constructs a 2-D Index from the given values. **Parameters:** * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`UInt`): The 1st initial value. * ​y (`UInt`): The 2nd initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, T2: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2) -> IndexList[3, element_type=dtype]` Constructs a 3-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​T2 (`Intable`): The type of the 3rd argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. * ​z (`T2`): The 3rd initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3) -> IndexList[4, element_type=dtype]` Constructs a 4-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​T2 (`Intable`): The type of the 3rd argument. * ​T3 (`Intable`): The type of the 4th argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. * ​z (`T2`): The 3rd initial value. * ​w (`T3`): The 4th initial value. **Returns:** The constructed IndexList. `Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, T4: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3, v: T4) -> IndexList[5, element_type=dtype]` Constructs a 5-D Index from the given values. **Parameters:** * ​T0 (`Intable`): The type of the 1st argument. * ​T1 (`Intable`): The type of the 2nd argument. * ​T2 (`Intable`): The type of the 3rd argument. * ​T3 (`Intable`): The type of the 4th argument. * ​T4 (`Intable`): The type of the 5th argument. * ​dtype (`DType`): The integer type of the underlying element. **Args:** * ​x (`T0`): The 1st initial value. * ​y (`T1`): The 2nd initial value. * ​z (`T2`): The 3rd initial value. * ​w (`T3`): The 4th initial value. * ​v (`T4`): The 5th initial value. **Returns:** The constructed IndexList. --- ## index_tensor ## Functions * [​`advanced_indexing_getitem`](./advanced_indexing_getitem): Implement basic numpy-style advanced indexing. * [​`advanced_indexing_getitem_shape`](./advanced_indexing_getitem_shape): Calculate the output shape from advanced indexing. * [​`advanced_indexing_setitem_inplace`](./advanced_indexing_setitem_inplace): Implement basic numpy-style advanced indexing with assignment. * [​`index_tensor`](./index_tensor): Index\_tensor operation; based on modified implementation of gather\_nd. * [​`index_tensor_shape`](./index_tensor_shape): Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible. --- ## index_tensor `index_tensor[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)` Index\_tensor operation; based on modified implementation of gather\_nd. **Parameters:** * ​type (`DType`): Type of data tensor. * ​indices\_type (`DType`): Type of indices tensor. * ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1). * ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1). * ​output\_rank (`Int`): Rank of output tensor. * ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing starts from dimension of data\[batch\_dims:]. * ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected to be within bounds \[-s, s-1] along axis of size s. It is an error if any of the index values are out of bounds. * ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b. * ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler. --- ## index_tensor_shape `index_tensor_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]` Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​output\_rank (`Int`): Rank of the output tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​batch\_dims (`Int`): Batch dimensions. * ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. **Returns:** The output shape. --- ## Indexer The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertable to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons. ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__index__` `__index__(self: _Self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## IndexList `@register_passable(trivial)` `struct IndexList[size: Int, *, element_type: DType = int64]` A base struct that implements size agnostic index functions. ## Parameters * ​size (`Int`): The size of the tuple. * ​element\_type (`DType`): The underlying dtype of the integer element value. ## Fields * ​data (`StaticTuple[SIMD[element_type, 1], size]`): The underlying storage of the tuple value. ## Implemented traits `AnyType`, `Comparable`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Methods ### `__init__` `__init__() -> Self` Constructs a static int tuple of the given size. `@implicit` `__init__(data: StaticTuple[SIMD[element_type, 1], size]) -> Self` Constructs a static int tuple of the given size. **Args:** * ​data (`StaticTuple[SIMD[element_type, 1], size]`): The StaticTuple to construct the IndexList from. `@implicit` `__init__(elems: Tuple[Int, Int]) -> Self` Constructs a static int tuple given a tuple of integers. **Args:** * ​elems (`Tuple[Int, Int]`): The tuple to copy from. `@implicit` `__init__(elems: Tuple[Int, Int, Int]) -> Self` Constructs a static int tuple given a tuple of integers. **Args:** * ​elems (`Tuple[Int, Int, Int]`): The tuple to copy from. `@implicit` `__init__(elems: Tuple[Int, Int, Int, Int]) -> Self` Constructs a static int tuple given a tuple of integers. **Args:** * ​elems (`Tuple[Int, Int, Int, Int]`): The tuple to copy from. `@implicit` `__init__(*elems: Int, *, __list_literal__: Tuple[] = Tuple()) -> Self` Constructs a static int tuple given a set of arguments. **Args:** * ​\*elems (`Int`): The elements to construct the tuple. * ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for list literals. `@implicit` `__init__(elem: Int) -> Self` Constructs a static int tuple given a set of arguments. **Args:** * ​elem (`Int`): The elem to splat into the tuple. `__init__(*, other: Self) -> Self` Copy constructor. **Args:** * ​other (`Self`): The other tuple to copy from. `@implicit` `__init__(values: VariadicList[Int]) -> Self` Creates a tuple constant using the specified values. **Args:** * ​values (`VariadicList[Int]`): The list of values. ### `__getitem__` `__getitem__[idx: Int](self) -> Int` Gets an element from the tuple by index. **Parameters:** * ​idx (`Int`): The element index. **Returns:** The tuple element value. `__getitem__[I: Indexer](self, idx: I) -> Int` Gets an element from the tuple by index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The element index. **Returns:** The tuple element value. ### `__setitem__` `__setitem__[idx: Int](mut self, val: Int)` Sets an element in the tuple at the given static index. **Parameters:** * ​idx (`Int`): The element index. **Args:** * ​val (`Int`): The value to store. `__setitem__[idx: Int](mut self, val: SIMD[element_type, 1])` Sets an element in the tuple at the given static index. **Parameters:** * ​idx (`Int`): The element index. **Args:** * ​val (`SIMD[element_type, 1]`): The value to store. `__setitem__(mut self, idx: Int, val: Int)` Sets an element in the tuple at the given index. **Args:** * ​idx (`Int`): The element index. * ​val (`Int`): The value to store. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using LT comparison. A tuple is less-than another tuple if all corresponding elements of lhs is less than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using LE comparison. A tuple is less-or-equal than another tuple if all corresponding elements of lhs is less-or-equal than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares this tuple to another tuple for equality. The tuples are equal if all corresponding elements are equal. **Args:** * ​rhs (`Self`): The other tuple. **Returns:** The comparison result. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares this tuple to another tuple for non-equality. The tuples are non-equal if at least one element of LHS isn't equal to the corresponding element from RHS. **Args:** * ​rhs (`Self`): The other tuple. **Returns:** The comparison result. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using GT comparison. A tuple is greater-than than another tuple if all corresponding elements of lhs is greater-than than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Compares this tuple to another tuple using GE comparison. A tuple is greater-or-equal than another tuple if all corresponding elements of lhs is greater-or-equal than rhs. Note: This is **not** a lexical comparison. **Args:** * ​rhs (`Self`): Right hand side tuple. **Returns:** The comparison result. ### `__add__` `__add__(self, rhs: Self) -> Self` Performs element-wise integer add. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Performs element-wise integer subtract. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Performs element-wise integer multiply. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Performs element-wise integer floor division. **Args:** * ​rhs (`Self`): The elementwise divisor. **Returns:** The resulting index tuple. ### `__rfloordiv__` `__rfloordiv__(self, rhs: Self) -> Self` Floor divides rhs by this object. **Args:** * ​rhs (`Self`): The value to elementwise divide by self. **Returns:** The resulting index tuple. ### `__len__` `__len__(self) -> Int` Returns the size of the tuple. **Returns:** The tuple size. ### `as_tuple` `as_tuple(self) -> StaticTuple[Int, size]` Converts this IndexList to StaticTuple. **Returns:** The corresponding StaticTuple object. ### `canonicalize` `canonicalize(self) -> IndexList[size]` Canonicalizes the IndexList. **Returns:** Canonicalizes the object. ### `flattened_length` `flattened_length(self) -> Int` Returns the flattened length of the tuple. **Returns:** The flattened length of the tuple. ### `remu` `remu(self, rhs: Self) -> Self` Performs element-wise integer unsigned modulo. **Args:** * ​rhs (`Self`): Right hand side operand. **Returns:** The resulting index tuple. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this IndexList value to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Get the tuple as a string. **Returns:** A string representation. ### `cast` `cast[dtype: DType](self) -> IndexList[size, element_type=dtype]` Casts to the target DType. **Parameters:** * ​dtype (`DType`): The dtype to cast towards. **Returns:** The list casted to the target type. ### `__hash__` `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. --- ## inf `inf[dtype: DType]() -> SIMD[dtype, 1]` Gets a +inf value for the given dtype. **Constraints:** Can only be used for FP dtypes. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The +inf value of the given dtype. --- ## info Contains information about GPU architectures and their capabilities. This module provides detailed specifications for various GPU models including NVIDIA and AMD GPUs. It includes information about compute capabilities, memory specifications, thread organization, and performance characteristics. ## Aliases ### `A10` `alias A10 = Info(__init__[__mlir_type.!kgen.string]("A10"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.5999999999999996), __init__[__mlir_type.!kgen.string]("sm_86"), 72, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)` ### `A100` `alias A100 = Info(__init__[__mlir_type.!kgen.string]("A100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8), __init__[__mlir_type.!kgen.string]("sm_80"), 108, 32, 2048, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `B100` `alias B100 = Info(__init__[__mlir_type.!kgen.string]("B100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 132, 32, -1, 32, 64, 1536, 32, 262144, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `B200` `alias B200 = Info(__init__[__mlir_type.!kgen.string]("B200"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 148, 32, -1, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `DEFAULT_GPU` `alias DEFAULT_GPU = from_name[::StringSlice[::Bool()` ### `DEFAULT_GPU_ARCH` `alias DEFAULT_GPU_ARCH = _accelerator_arch()` ### `DEFAULT_GPU_TARGET` `alias DEFAULT_GPU_TARGET = from_name[::StringSlice[::Bool().target()` ### `H100` `alias H100 = Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ### `L4` `alias L4 = Info(__init__[__mlir_type.!kgen.string]("L4"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 58, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)` ### `MI300X` `alias MI300X = Info(__init__[__mlir_type.!kgen.string]("MI300X"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx942"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](9.4000000000000003), __init__[__mlir_type.!kgen.string]("CDNA3"), 304, 64, 2048, 64, 32, 2048, 2, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 2, 128, 4, 1024)` ### `NoGPU` `alias NoGPU = Info(__init__[__mlir_type.!kgen.string]("NoGPU"), Vendor(__init__[__mlir_type.!pop.int_literal](0)), __init__[__mlir_type.!kgen.string]("none"), __init__[__mlir_type.!kgen.string]("no_gpu"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.int_literal](0), __init__[__mlir_type.!kgen.string](""), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, __init__[__mlir_type.!kgen.string]("none"), 0, 0, 0, 0, 0, 0)` ### `OrinNano` `alias OrinNano = Info(__init__[__mlir_type.!kgen.string]("Orin Nano"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.6999999999999993), __init__[__mlir_type.!kgen.string]("sm_87"), 8, 32, 1536, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)` ### `RTX2060` `alias RTX2060 = Info(__init__[__mlir_type.!kgen.string]("RTX2060"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("turing"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](7.5), __init__[__mlir_type.!kgen.string]("sm_75"), 30, 32, 2048, 32, 64, 2048, 16, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 16, 32, 4, 1024)` ### `RTX4090` `alias RTX4090 = Info(__init__[__mlir_type.!kgen.string]("RTX4090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 128, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)` ### `RTX4090m` `alias RTX4090m = Info(__init__[__mlir_type.!kgen.string]("RTX4090m"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 76, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)` ### `RTX5090` `alias RTX5090 = Info(__init__[__mlir_type.!kgen.string]("RTX5090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("sm_120a"), 170, 32, -1, 32, 64, 1536, 32, 59392, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)` ## Structs * [​`Info`](/mojo/stdlib/gpu/host/info/Info): Comprehensive information about a GPU architecture. * [​`Vendor`](/mojo/stdlib/gpu/host/info/Vendor): Represents GPU vendors. ## Functions * [​`is_cpu`](/mojo/stdlib/gpu/host/info/is_cpu): Checks if the target is a CPU (compile-time version). * [​`is_gpu`](/mojo/stdlib/gpu/host/info/is_gpu): Checks if the target is a GPU (compile-time version). * [​`is_valid_target`](/mojo/stdlib/gpu/host/info/is_valid_target): Checks if the target is valid (compile-time version). --- ## info Implements methods for querying the host target info. You can import these APIs from the `sys` package. For example: ```mojo from sys import CompilationTarget print(CompilationTarget.is_x86()) ``` ## Structs * [​`CompilationTarget`](/mojo/stdlib/sys/info/CompilationTarget): A struct that provides information about a target architecture. ## Functions * [​`alignof`](/mojo/stdlib/sys/info/alignof): Returns the align of (in bytes) of the type. * [​`bitwidthof`](/mojo/stdlib/sys/info/bitwidthof): Returns the size of (in bits) of the type. * [​`has_accelerator`](/mojo/stdlib/sys/info/has_accelerator): Returns True if the host system has an accelerator and False otherwise. * [​`has_amd_gpu_accelerator`](/mojo/stdlib/sys/info/has_amd_gpu_accelerator): Returns True if the host system has an AMD GPU and False otherwise. * [​`has_avx`](/mojo/stdlib/sys/info/has_avx): Returns True if the host system has AVX, otherwise returns False. * [​`has_avx2`](/mojo/stdlib/sys/info/has_avx2): Returns True if the host system has AVX2, otherwise returns False. * [​`has_avx512f`](/mojo/stdlib/sys/info/has_avx512f): Returns True if the host system has AVX512, otherwise returns False. * [​`has_fma`](/mojo/stdlib/sys/info/has_fma): Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False. * [​`has_intel_amx`](/mojo/stdlib/sys/info/has_intel_amx): Returns True if the host system has Intel AMX support, otherwise returns False. * [​`has_neon`](/mojo/stdlib/sys/info/has_neon): Returns True if the host system has Neon support, otherwise returns False. * [​`has_neon_int8_dotprod`](/mojo/stdlib/sys/info/has_neon_int8_dotprod): Returns True if the host system has the Neon int8 dot product extension, otherwise returns False. * [​`has_neon_int8_matmul`](/mojo/stdlib/sys/info/has_neon_int8_matmul): Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False. * [​`has_nvidia_gpu_accelerator`](/mojo/stdlib/sys/info/has_nvidia_gpu_accelerator): Returns True if the host system has an NVIDIA GPU and False otherwise. * [​`has_sse4`](/mojo/stdlib/sys/info/has_sse4): Returns True if the host system has sse4, otherwise returns False. * [​`has_vnni`](/mojo/stdlib/sys/info/has_vnni): Returns True if the host system has avx512\_vnni, otherwise returns False. * [​`is_32bit`](/mojo/stdlib/sys/info/is_32bit): Returns True if the maximum integral value is 32 bit. * [​`is_64bit`](/mojo/stdlib/sys/info/is_64bit): Returns True if the maximum integral value is 64 bit. * [​`is_amd_gpu`](/mojo/stdlib/sys/info/is_amd_gpu): Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise. * [​`is_apple_m1`](/mojo/stdlib/sys/info/is_apple_m1): Returns True if the host system is an Apple M1 with AMX support, otherwise returns False. * [​`is_apple_m2`](/mojo/stdlib/sys/info/is_apple_m2): Returns True if the host system is an Apple M2 with AMX support, otherwise returns False. * [​`is_apple_m3`](/mojo/stdlib/sys/info/is_apple_m3): Returns True if the host system is an Apple M3 with AMX support, otherwise returns False. * [​`is_apple_m4`](/mojo/stdlib/sys/info/is_apple_m4): Returns True if the host system is an Apple M4 with AMX support, otherwise returns False. * [​`is_apple_silicon`](/mojo/stdlib/sys/info/is_apple_silicon): Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False. * [​`is_big_endian`](/mojo/stdlib/sys/info/is_big_endian): Returns True if the host endianness is big and False otherwise. * [​`is_gpu`](/mojo/stdlib/sys/info/is_gpu): Returns True if the target triple is GPU and False otherwise. * [​`is_little_endian`](/mojo/stdlib/sys/info/is_little_endian): Returns True if the host endianness is little and False otherwise. * [​`is_neoverse_n1`](/mojo/stdlib/sys/info/is_neoverse_n1): Returns True if the host system is a Neoverse N1 system, otherwise returns False. * [​`is_nvidia_gpu`](/mojo/stdlib/sys/info/is_nvidia_gpu): Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise. * [​`is_triple`](/mojo/stdlib/sys/info/is_triple): Returns True if the target triple of the compiler matches the input and False otherwise. * [​`is_x86`](/mojo/stdlib/sys/info/is_x86): Returns True if the host system architecture is X86 and False otherwise. * [​`num_logical_cores`](/mojo/stdlib/sys/info/num_logical_cores): Returns the number of hardware threads, including hyperthreads across all CPU sockets. * [​`num_performance_cores`](/mojo/stdlib/sys/info/num_performance_cores): Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores. * [​`num_physical_cores`](/mojo/stdlib/sys/info/num_physical_cores): Returns the number of physical cores across all CPU sockets. * [​`os_is_linux`](/mojo/stdlib/sys/info/os_is_linux): Returns True if the host operating system is Linux. * [​`os_is_macos`](/mojo/stdlib/sys/info/os_is_macos): Returns True if the host operating system is macOS. * [​`os_is_windows`](/mojo/stdlib/sys/info/os_is_windows): Returns True if the host operating system is Windows. * [​`simdbitwidth`](/mojo/stdlib/sys/info/simdbitwidth): Returns the vector size (in bits) of the specified target. * [​`simdbytewidth`](/mojo/stdlib/sys/info/simdbytewidth): Returns the vector size (in bytes) of the specified target. * [​`simdwidthof`](/mojo/stdlib/sys/info/simdwidthof): Returns the vector size of the type on the host system. * [​`sizeof`](/mojo/stdlib/sys/info/sizeof): Returns the size of (in bytes) of the type. --- ## Info `@register_passable(trivial)` `struct Info[func_type: AnyTrivialRegType, func: func_type, target: target]` Contains compilation information and results for a function. Stores assembly/IR code, function metadata, and error information from compiling a function. Attributes: populate: Function to populate captures ## Parameters * ​func\_type (`AnyTrivialRegType`): Type of the function being compiled. * ​func (`func_type`): The function being compiled. * ​target (`target`): The target architecture to compile for. ## Fields * ​asm (`StringSlice[StaticConstantOrigin]`): Generated assembly/IR code from the compilation process. * ​function\_name (`StringSlice[StaticConstantOrigin]`): Mangled name of the compiled function, used for symbol resolution. * ​module\_name (`StringSlice[StaticConstantOrigin]`): Name of the module containing the compiled function. * ​num\_captures (`Int`): Number of variables captured by the function closure. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `populate` `alias populate = rebind[AnyTrivialRegType,AnyTrivialRegType](compile_offload_closure(target, :!kgen.param func))` Function pointer to populate captured variables in the function closure. ## Methods ### `__contains__` `__contains__(self, content: String) -> Bool` Checks if content exists in the assembly/IR. **Args:** * ​content (`String`): String to search for. **Returns:** True if content is found, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the assembly/IR to a writer. **Parameters:** * ​W (`Writer`): Type that implements the Writer interface for writing data. **Args:** * ​writer (`W`): Writer object to write the assembly to. ### `__str__` `__str__(self) -> String` Converts the assembly/IR to a string. **Returns:** The assembly/IR as a string. ### `write_text` `write_text[path_like: PathLike](self, path: path_like)` Writes the assembly/IR to a file. **Parameters:** * ​path\_like (`PathLike`): Type that implements the `PathLike` interface for file path representation. **Args:** * ​path (`path_like`): Path to write the file to. **Raises:** If file writing operations fail. --- ## Info `@register_passable` `struct Info` Comprehensive information about a GPU architecture. This struct contains detailed specifications about GPU capabilities, including compute units, memory, thread organization, and performance characteristics. ## Fields * ​name (`StringSlice[StaticConstantOrigin]`): The model name of the GPU. * ​vendor (`Vendor`): The vendor/manufacturer of the GPU (e.g., NVIDIA, AMD). * ​api (`StringSlice[StaticConstantOrigin]`): The graphics/compute API supported by the GPU (e.g., CUDA, ROCm). * ​arch\_name (`StringSlice[StaticConstantOrigin]`): The architecture name of the GPU (e.g., sm\_80, gfx942). * ​compile\_options (`StringSlice[StaticConstantOrigin]`): Compiler options specific to this GPU architecture. * ​compute (`SIMD[float32, 1]`): Compute capability version number for NVIDIA GPUs. * ​version (`StringSlice[StaticConstantOrigin]`): Version string of the GPU architecture. * ​sm\_count (`Int`): Number of streaming multiprocessors (SMs) on the GPU. * ​warp\_size (`Int`): Number of threads in a warp/wavefront. * ​threads\_per\_sm (`Int`): Maximum number of threads per streaming multiprocessor. * ​threads\_per\_warp (`Int`): Number of threads that execute together in a warp/wavefront. * ​warps\_per\_multiprocessor (`Int`): Maximum number of warps that can be active on a multiprocessor. * ​threads\_per\_multiprocessor (`Int`): Maximum number of threads that can be active on a multiprocessor. * ​thread\_blocks\_per\_multiprocessor (`Int`): Maximum number of thread blocks that can be active on a multiprocessor. * ​shared\_memory\_per\_multiprocessor (`Int`): Size of shared memory available per multiprocessor in bytes. * ​register\_file\_size (`Int`): Total size of the register file per multiprocessor in bytes. * ​register\_allocation\_unit\_size (`Int`): Minimum allocation size for registers in bytes. * ​allocation\_granularity (`StringSlice[StaticConstantOrigin]`): Description of how resources are allocated on the GPU. * ​max\_registers\_per\_thread (`Int`): Maximum number of registers that can be allocated to a single thread. * ​max\_registers\_per\_block (`Int`): Maximum number of registers that can be allocated to a thread block. * ​max\_blocks\_per\_multiprocessor (`Int`): Maximum number of blocks that can be scheduled on a multiprocessor. * ​shared\_memory\_allocation\_unit\_size (`Int`): Minimum allocation size for shared memory in bytes. * ​warp\_allocation\_granularity (`Int`): Granularity at which warps are allocated resources. * ​max\_thread\_block\_size (`Int`): Maximum number of threads allowed in a thread block. ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__lt__` `__lt__(self, other: Self) -> Bool` Compares if this GPU has lower compute capability than another. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has lower compute capability, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Compares if this GPU has lower or equal compute capability. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has lower or equal compute capability. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two GPU Info instances represent the same GPU model. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if both instances represent the same GPU model. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two GPU Info instances represent different GPU models. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if instances represent different GPU models. ### `__gt__` `__gt__(self, other: Self) -> Bool` Compares if this GPU has higher compute capability than another. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has higher compute capability, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Compares if this GPU has higher or equal compute capability. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if this GPU has higher or equal compute capability. ### `__is__` `__is__(self, other: Self) -> Bool` Identity comparison operator for GPU Info instances. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if both instances represent the same GPU model. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Negative identity comparison operator for GPU Info instances. **Args:** * ​other (`Self`): Another GPU Info instance to compare against. **Returns:** True if instances represent different GPU models. ### `target` `target(self) -> target` Gets the MLIR target configuration for this GPU. **Returns:** MLIR target configuration for the GPU. ### `from_target` `static from_target[target: target]() -> Self` Creates an Info instance from an MLIR target. **Parameters:** * ​target (`target`): MLIR target configuration. **Returns:** GPU info corresponding to the target. ### `from_name` `static from_name[name: StringSlice[StaticConstantOrigin]]() -> Self` Creates an Info instance from a GPU architecture name. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): GPU architecture name (e.g., "sm\_80", "gfx942"). **Returns:** GPU info corresponding to the architecture name. ### `occupancy` `occupancy(self, *, threads_per_block: Int, registers_per_thread: Int) -> SIMD[float64, 1]` Calculates theoretical occupancy for given thread and register config. Occupancy represents the ratio of active warps to the maximum possible warps on a streaming multiprocessor. Note: TODO (KERN-795): Add occupancy calculation based on shared memory usage and thread block size and take use the minimum value. **Args:** * ​threads\_per\_block (`Int`): Number of threads in each block. * ​registers\_per\_thread (`Int`): Number of registers used by each thread. **Returns:** Occupancy as a ratio between 0.0 and 1.0. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes GPU information to a writer. Outputs all GPU specifications and capabilities to the provided writer in a human-readable format. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): A Writer instance to output the GPU information. ### `__str__` `__str__(self) -> String` Returns a string representation of the GPU information. Converts all GPU specifications and capabilities to a human-readable string format. **Returns:** String containing all GPU information. --- ## init_intel_amx `init_intel_amx() -> Bool` --- ## inline_array Provides a fixed-size array implementation with compile-time size checking. The `InlineArray` type represents a fixed-size sequence of homogeneous elements where the size is determined at compile time. It provides efficient memory layout and bounds checking while maintaining type safety. The `InlineArray` type is part of the `prelude` module and therefore does not need to be imported in order to use it. Examples: ```mojo # Create an array of 3 integers var arr = InlineArray[Int, 3](1, 2, 3) # Access elements print(arr[0]) # Prints 1 # Fill with a value var filled = InlineArray[Int, 5](fill=42) ``` Notes: * For historical reasons, destructors are not run by default on the elements of an `InlineArray`. This can be controlled with the `run_destructors` parameter. In the future, this will default to `True` and the `run_destructors` parameter will be removed. ## Structs * [​`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray): A fixed-size sequence of homogeneous elements where size is a constant expression. --- ## InlineArray `struct InlineArray[ElementType: Copyable & Movable, size: Int, *, run_destructors: Bool = False]` A fixed-size sequence of homogeneous elements where size is a constant expression. InlineArray provides a fixed-size array implementation with compile-time size checking. The array size is determined at compile time and cannot be changed. Elements must implement the `Copyable` and `Movable` traits. Examples: ```mojo # Create array of 3 integers var arr = InlineArray[Int, 3](1, 2, 3) # Create array filled with value var filled = InlineArray[Int, 5](fill=42) # Access elements print(arr[0]) # Prints 1 ``` ## Parameters * ​ElementType (`Copyable & Movable`): The type of the elements in the array. Must implement `Copyable` and `Movable`. * ​size (`Int`): The size of the array. Must be a positive integer constant. * ​run\_destructors (`Bool`): Whether to run destructors on the elements. Defaults to `False` for backwards compatibility. Will default to `True` in the future. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `type` `alias type = array, :trait ElementType>` ## Methods ### `__init__` `__init__(out self)` This constructor will always cause a compile time error if used. It is used to steer users away from uninitialized memory. `__init__(out self, *, uninitialized: Bool)` Create an InlineArray with uninitialized memory. Examples: ```mojo var uninitialized_array = InlineArray[Int, 10](uninitialized=True) ``` Notes: This constructor is unsafe and should be used with caution. The array elements will be uninitialized and accessing them before initialization is undefined behavior. **Args:** * ​uninitialized (`Bool`): A boolean to indicate if the array should be initialized. Always set to `True` (it's not actually used inside the constructor). `__init__(out self, *, owned unsafe_assume_initialized: InlineArray[UnsafeMaybeUninitialized[ElementType], size])` Constructs an `InlineArray` from an `InlineArray` of `UnsafeMaybeUninitialized`. Warning: This is an unsafe constructor. Only use it if you are certain all elements are properly initialized. Notes: This constructor assumes all elements in the input array are initialized. Using uninitialized elements results in undefined behavior, even for types that are valid for any bit pattern (e.g. `Int` or `Float`). **Args:** * ​unsafe\_assume\_initialized (`InlineArray[UnsafeMaybeUninitialized[ElementType], size]`): The array of `UnsafeMaybeUninitialized` elements. All elements must be initialized. `@implicit` `__init__[batch_size: Int = 64](out self, fill: ElementType)` Constructs an array where each element is initialized to the supplied value. Examples: ```mojo var filled = InlineArray[Int, 5](fill=42) # [42, 42, 42, 42, 42] # For large arrays, consider adjusting batch_size to balance # compile time and runtime performance: var large = InlineArray[Int, 10000].__init__[batch_size=32](fill=0) ``` Notes: * Full unrolling with large arrays (>2k elements) can cause significant compiler slowdowns. * Using batch\_size=64 balances AVX512 efficiency and instruction cache usage. * For very large arrays, using smaller batch sizes (e.g., 32 or 16) can further improve compilation speed while still maintaining good runtime performance. **Parameters:** * ​batch\_size (`Int`): The number of elements to unroll for filling the array. Default is 64, which optimizes for AVX512 operations on modern CPUs. For large arrays (>2k elements), this batched approach significantly improves compile times compared to full unrolling while maintaining good runtime performance. **Args:** * ​fill (`ElementType`): The element value to fill each index with. `@implicit` `__init__(out self, owned *elems: ElementType, *, __list_literal__: Tuple[] = Tuple())` Constructs an array from a variadic list of elements. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) # [1, 2, 3] ``` **Args:** * ​\*elems (`ElementType`): The elements to initialize the array with. Must match the array size. * ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for list literals. `__init__(out self, *, owned storage: VariadicListMem[ElementType, origin, is_owned])` Construct an array from a low-level internal representation. **Args:** * ​storage (`VariadicListMem[ElementType, origin, is_owned]`): The variadic list storage to construct from. Must match array size. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy constructs the array from another array. Notes: Creates a deep copy by copying each element individually. **Args:** * ​other (`Self`): The array to copy from. ### `__del__` `__del__(owned self)` Deallocates the array and destroys its elements. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) # arr's destructor is called automatically when it goes out of scope ``` Notes: This destructor is called automatically when the array goes out of scope. If the array's `run_destructors` parameter is `True`, it will call the destructor on each element in the array before deallocating the array's memory. ### `__getitem__` `__getitem__[I: Indexer](ref self, idx: I) -> ref [self] ElementType` Gets a reference to the element at the given index. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(arr[0]) # Prints 1 - first element print(arr[1]) # Prints 2 - second element print(arr[-1]) # Prints 3 - last element print(arr[-2]) # Prints 2 - second to last element ``` Notes: This method provides array-style indexing access to elements in the InlineArray. It supports both positive indices starting from 0 and negative indices counting backwards from the end of the array. The index is bounds-checked at runtime. **Parameters:** * ​I (`Indexer`): The type parameter representing the index type, must implement Indexer trait. **Args:** * ​idx (`I`): The index to access. Can be positive (0 to len-1) or negative (-len to -1). **Returns:** A reference to the element at the specified index. `__getitem__[I: Indexer, //, idx: I](ref self) -> ref [self] ElementType` Gets a reference to the element at the given index with compile-time bounds checking. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(arr[0]) # Prints 1 - first element print(arr[-1]) # Prints 3 - last element ``` Notes: This overload provides array-style indexing with compile-time bounds checking. The index must be a compile-time constant value. It supports both positive indices starting from 0 and negative indices counting backwards from the end of the array. **Parameters:** * ​I (`Indexer`): The type parameter representing the index type, must implement Indexer trait. * ​idx (`I`): The compile-time constant index to access. Can be positive (0 to len-1) or negative (-len to -1). **Returns:** A reference to the element at the specified index. ### `__contains__` `__contains__[T: EqualityComparable & Copyable & Movable, //](self: InlineArray[T, size], value: T) -> Bool` Tests if a value is present in the array using the `in` operator. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(3 in arr) # Prints True - value exists print(4 in arr) # Prints False - value not found ``` Notes: This method enables using the `in` operator to check if a value exists in the array. It performs a linear search comparing each element for equality with the given value. The element type must implement the `EqualityComparable`, `Copyable` and `Movable` traits to support equality comparison. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The element type, must implement both `EqualityComparable` and `Copyable` and `Movable`. **Args:** * ​value (`T`): The value to search for. **Returns:** True if the value is found in any position in the array, False otherwise. ### `copy` `copy(self) -> Self` Creates a deep copy of the array. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) var copy = arr.copy() # Creates new array [1, 2, 3] ``` **Returns:** A new array containing copies of all elements. ### `__len__` `__len__(self) -> Int` Returns the length of the array. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(len(arr)) # Prints 3 ``` Notes: The length is a compile-time constant value determined by the size parameter used when creating the array. **Returns:** The size of the array as an Int. ### `unsafe_get` `unsafe_get[I: Indexer](ref self, idx: I) -> ref [self] ElementType` Gets a reference to an element without bounds checking. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) print(arr.unsafe_get(0)) # Prints 1 ``` Warning: This is an unsafe method. No bounds checking is performed. Using an invalid index will cause undefined behavior. Negative indices are not supported. Notes: This is an unsafe method that skips bounds checking for performance. Users should prefer `__getitem__` instead for safety. **Parameters:** * ​I (`Indexer`): A type parameter representing the index type, must implement Indexer trait. **Args:** * ​idx (`I`): The index of the element to get. Must be non-negative and in bounds. Using an invalid index will cause undefined behavior. **Returns:** A reference to the element at the given index. ### `unsafe_ptr` `unsafe_ptr(ref self) -> UnsafePointer[ElementType, mut=self_is_mut, origin=self_is_origin]` Gets an unsafe pointer to the underlying array storage. Examples: ```mojo var arr = InlineArray[Int, 3](1, 2, 3) var ptr = arr.unsafe_ptr() print(ptr[0]) # Prints 1 ``` Warning: This is an unsafe method. The returned pointer: * Becomes invalid if the array is moved * Must not be used to access memory outside array bounds * Must be refreshed after any operation that could move the array Notes: Returns a raw pointer to the array's memory that can be used for direct memory access. The pointer inherits mutability from the array reference. **Returns:** An `UnsafePointer` to the underlying array storage. The pointer's mutability matches that of the array reference. --- ## Inner_matmul_default `struct Inner_matmul_default` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile. --- ## Inner_matmul_i8mm `struct Inner_matmul_i8mm` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows2, TileN, TileK) tile. --- ## Inner_matmul_neon `struct Inner_matmul_neon` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile. --- ## Inner_matmul_vnni `struct Inner_matmul_vnni[saturated_vnni: Bool]` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `InnerMatmulKernel`, `Movable`, `UnknownDestructibility` ## Methods ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile. --- ## inner_product `inner_product(a: IntTuple[origin], b: IntTuple[origin]) -> Int` Compute the inner product of two `IntTuple`s. For flat tuples, this is the sum of element-wise products. For nested tuples, the function recurses into corresponding nested elements. Note: If the input tuples have different lengths, `abort()` will be called. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple`. * ​b (`IntTuple[origin]`): Second `IntTuple`. **Returns:** The inner product as an `Int`. --- ## InnerKernelID `@register_passable(trivial)` `struct InnerKernelID` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DEFAULT` `alias DEFAULT = InnerKernelID(0)` ### `I8MM` `alias I8MM = InnerKernelID(3)` ### `NEON` `alias NEON = InnerKernelID(2)` ### `VNNI` `alias VNNI = InnerKernelID(1)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` --- ## InnerMatmulKernel ## Implemented traits `AnyType`, `Copyable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__inner_matmul__` `__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self: _Self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)` --- ## input `input(prompt: String = __init__[__mlir_type.!kgen.string]("")) -> String` Reads a line of input from the user. Reads a line from standard input, converts it to a string, and returns that string. If the prompt argument is present, it is written to standard output without a trailing newline. Examples: ```mojo name = input("Enter your name: ") print("Hello", name) ``` If the user enters "Mojo" it prints "Hello Mojo". **Args:** * ​prompt (`String`): An optional string to be printed before reading input. **Returns:** A string containing the line read from the user input. --- ## Install guide import TutorialStack from '@site/src/components/TutorialStack'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; You can install all the Modular APIs and tools (including MAX and Mojo) as a single package called `modular`, using `pip` or `magic` (or other Python and Conda package managers). The `modular` package is available as a nightly and a stable build. You can also select the latest nightly or stable documentation, using a drop-down in the website header. By default, we show the nightly version so you always see the latest APIs and documentation. If you just want to get started, instead see our [quickstart guide](/max/get-started). :::note If you'll mostly be programming in Mojo, we recommend installing with `magic`, because the `pip` package currently doesn't include the Mojo LSP or debugger. ::: ## Install To get the latest performance improvements and new features, we recommend installing our nightly build, which we release several times a week. If you want a better tested but slightly older version, you can install a stable build. (Each stable release is described in the [changelog](/max/changelog).) The `modular` package installs MAX, Mojo, and other package dependencies. :::note GitHub stable branch When using a stable build, make sure you also checkout the `stable` branch when you clone the [Modular repo](https://github.com/modular/modular) (because the `main` branch includes the latest nightly code). For example: ```sh git clone -b stable https://github.com/modular/modular.git ``` ::: ## Uninstall You can uninstall `modular` from your virtual environment with the following command: ```sh pip uninstall modular ``` To deactivate your virtual environment, run: ```sh deactivate ``` You can uninstall `modular` from your virtual environment with the following command: ```sh uv pip uninstall modular ``` To deactivate your virtual environment, run: ```sh deactivate ``` If you installed with `magic`, just delete the project paths that you created with `magic init` (paths with a `pyproject.toml`, `mojoproject.toml`, or `pixi.toml` file). To remove the `magic` tool, delete the `magic` binary: ```sh rm ~/.modular/bin/magic ``` ## What's included The `modular` Python wheel installs the following: - MAX tools and libraries - [`max` CLI](/max/max-cli) - [`max` Python library](/max/api/python/) - [`max` Mojo library](/max/api/mojo/) - Mojo tools and libraries - [`mojo` CLI](/mojo/cli) - [Mojo standard library](/mojo/lib) `pip` known issues: - The Mojo LSP and Mojo debugger aren't included. If you want to develop with Mojo, we currently recommend you install the `modular` conda package with [Magic](/magic) or [conda](/magic/conda). The `max` conda package installs the following: - MAX tools and libraries - [`max` Python library](/max/api/python/) - [`max` Mojo library](/max/api/mojo/) - [MAX Engine C API](/max/api/c/) - Mojo tools and libraries - [`mojo` CLI](/mojo/cli) - [Mojo standard library](/mojo/lib) - Mojo LSP - Mojo debugger The `max-pipelines` package installs the [`max` CLI](/max/max-cli). `magic` known issues: - You might encounter issues if you invoke `magic` within a `conda` virtual environment. It's best if you don't mix the two tools. ## Next steps export const tutorials = [ 'magic', 'deploy-llama-vision', ]; --- ## Install MAX with pip import TutorialStack from '@site/src/components/TutorialStack'; import InstallModularNoMagic from '@site/docs/_includes/install-modular-no-magic.mdx'; You can install everything you need to build and deploy MAX models using pip. However, if you want to develop with Mojo, we recommend using [Magic](/magic) or [conda](/magic/conda). ## Get started using pip Here's how to install the Modular platform APIs and tools with pip, and then deploy a GenAI model on a local endpoint: 1. Start a Python virtual environment and install MAX: 2. Start a local endpoint for Llama 3: ```sh max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF ``` In addition to starting a local server, this downloads the model weights and compiles the model, which might take some time. The endpoint is ready when you see the URI printed in your terminal: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` 3. Now open another terminal to send a request using `curl`: ```sh curl -N http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the World Series in 2020?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` Now check out these tutorials for more about how to accelerate your GenAI models with MAX: export const maxTutorials = [ 'run-embeddings-with-max-serve', 'deploy-llama-vision', 'get-started-with-max-graph-in-python', ]; ## What's included The `modular` Python package installs the following: - MAX tools and libraries - [`max` CLI](/max/max-cli) - [`max` Python library](/max/api/python/) - [`max` Mojo library](/max/api/mojo/) - Mojo tools and libraries - [`mojo` CLI](/mojo/cli) - [Mojo standard library](/mojo/lib) ## Known issues - The Mojo LSP and Mojo debugger aren't included. If you want to develop with Mojo, we currently recommend you install the `max` conda package with [Magic](/magic) or [conda](/magic/conda). --- ## Install MAX/Mojo with conda import TutorialStack from '@site/src/components/TutorialStack'; Although we recommend using [Magic](/magic) to manage your virtual environments and packages for MAX and Mojo, you can also add MAX/Mojo to a [conda](https://docs.conda.io/projects/conda/en/latest/index.html) project. :::note The `max` package includes Mojo. There's no separate package for Mojo. ::: ## Get started with MAX Here's how to install MAX using conda and then deploy a GenAI model on a local endpoint: 1. Create a conda project that includes MAX: ```sh conda create -n max-project -c conda-forge -c https://conda.modular.com/max/ \ python=3.11 max=* max-pipelines=* -y ``` 2. Activate the environment: ```sh conda activate max-project ``` 3. Start a local endpoint for Llama 3: ```sh max-pipelines serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF ``` In addition to starting a local server, this downloads the model weights and compiles the model, which might take some time. The endpoint is ready when you see the URI printed in your terminal: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` 4. Now open another terminal to send a request using `curl`: ```sh curl -N http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "stream": true, "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the World Series in 2020?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' Now check out these tutorials for more about how to accelerate your GenAI models with MAX: export const maxTutorials = [ 'run-embeddings-with-max-serve', 'deploy-llama-vision', 'get-started-with-max-graph-in-python', ]; ## Get started with Mojo Here's how to install Mojo using conda and run a code example: 1. Create a conda project that includes MAX/Mojo: ```sh conda create -n mojo-project -c conda-forge -c https://conda.modular.com/max/ \ python=3.11 max=* -y ``` 2. Activate the environment and you'll have access to `mojo`: ```sh conda activate mojo-project ``` ```sh mojo --version ``` 3. Try one of the Mojo code examples: ```sh git clone https://github.com/modular/modular.git ``` ```sh cd max/examples/mojo ``` ```sh mojo hello_interop.mojo ``` ```output Hello Mojo 🔥! 9 6 3 Hello from Python! I can even print a numpy array: [1 2 3] ``` Now continue exploring Mojo with these tutorials: export const mojoTutorials = [ 'get-started', 'gpu/intro-tutorial', ]; ## What's included The `max` conda package installs the following: - MAX tools and libraries - [`max` Python library](/max/api/python/) - [`max` Mojo library](/max/api/mojo/) - [MAX Engine C API](/max/api/c/) - Mojo tools and libraries - [`mojo` CLI](/mojo/cli) - [Mojo standard library](/mojo/lib) - Mojo LSP - Mojo debugger The `max-pipelines` package installs the [`max` CLI](/max/max-cli). ## Known issues - You might encounter issues if you invoke `magic` within a `conda` virtual environment. It's best if you don't mix the two tools. --- ## Install Modular import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import MaxInstall from '@site/src/components/MaxInstall'; import CodeBlock from '@theme/CodeBlock'; export default function InstallModular({ folder = "modular" }) { return ( Create a project folder: {`mkdir ${folder} && cd ${folder}`} Create and activate a virtual environment: {`python3 -m venv .venv/${folder} \\ && source .venv/${folder}/bin/activate`} Install the modular Python package: {`pip install modular \\ --extra-index-url https://download.pytorch.org/whl/cpu \\ --extra-index-url https://dl.modular.com/public/nightly/python/simple/`} {`pip install modular \\ --extra-index-url https://modular.gateway.scarf.sh/simple/ \\ --extra-index-url https://download.pytorch.org/whl/cpu`} Install uv: {`curl -LsSf https://astral.sh/uv/install.sh | sh`} Then restart your terminal to make uv accessible. Create a project: {`uv init ${folder} && cd ${folder}`} Create and start a virtual environment: {`uv venv && source .venv/bin/activate`} Install the modular Python package: {`uv pip install modular \\ --extra-index-url https://download.pytorch.org/whl/cpu \\ --extra-index-url https://dl.modular.com/public/nightly/python/simple/ \\ --index-strategy unsafe-best-match`} {`uv pip install modular \\ --extra-index-url https://modular.gateway.scarf.sh/simple/ \\ --extra-index-url https://download.pytorch.org/whl/cpu \\ --index-strategy unsafe-best-match`} Install magic: Then run the source command that's printed in your terminal. Create a project: {`magic init ${folder} --format pyproject && cd ${folder}`} Install the max-pipelines conda package: {`magic add max-pipelines`} {`magic add "max-pipelines==25.3"`} Start the virtual environment: {`magic shell`} ); } --- ## Install Modular No Magic import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import CodeBlock from '@theme/CodeBlock'; export default function InstallModularNoMagic({ folder = "modular" }) { return ( Create a project folder: {`mkdir ${folder} && cd ${folder}`} Create and activate a virtual environment: {`python3 -m venv .venv/${folder} \\ && source .venv/${folder}/bin/activate`} Install the modular Python package: {`pip install modular \\ --extra-index-url https://download.pytorch.org/whl/cpu \\ --extra-index-url https://dl.modular.com/public/nightly/python/simple/`} {`pip install modular \\ --extra-index-url https://download.pytorch.org/whl/cpu`} Install uv: {`curl -LsSf https://astral.sh/uv/install.sh | sh`} Then restart your terminal to make uv accessible. Create a project: {`uv init ${folder} && cd ${folder}`} Create and start a virtual environment: {`uv venv && source .venv/bin/activate`} Install the modular Python package: {`uv pip install modular \\ --extra-index-url https://download.pytorch.org/whl/cpu \\ --extra-index-url https://dl.modular.com/public/nightly/python/simple/ \\ --index-strategy unsafe-best-match`} {`uv pip install modular \\ --extra-index-url https://download.pytorch.org/whl/cpu \\ --index-strategy unsafe-best-match`} ); } --- ## int Implements the Int class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Int`](/mojo/stdlib/builtin/int/Int): This type represents an integer value. ## Traits * [​`ImplicitlyIntable`](/mojo/stdlib/builtin/int/ImplicitlyIntable): The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly. * [​`Indexer`](/mojo/stdlib/builtin/int/Indexer): The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertable to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons. * [​`Intable`](/mojo/stdlib/builtin/int/Intable): The `Intable` trait describes a type that can be converted to an Int. * [​`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising): The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error. ## Functions * [​`index`](/mojo/stdlib/builtin/int/index-function): Returns the value of `__index__` for the given value. --- ## Int `@register_passable(trivial)` `struct Int` This type represents an integer value. ## Fields * ​value (`index`): The underlying storage for the integer value. ## Implemented traits `Absable`, `AnyType`, `Boolable`, `CeilDivable`, `Comparable`, `ConvertibleFromPython`, `Copyable`, `Defaultable`, `DevicePassable`, `EqualityComparable`, `ExplicitlyCopyable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `ImplicitlyBoolable`, `Indexer`, `Intable`, `KeyElement`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Powable`, `PythonConvertible`, `Representable`, `Roundable`, `Stringable`, `TypeIdentifiable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `BITWIDTH` `alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())` The bit width of the integer type. ### `device_type` `alias device_type = Int` Int is remapped to the same type when passed to accelerator devices. ### `MAX` `alias MAX = __init__[::Intable](SIMD(max_or_inf[::DType]()))` Returns the maximum integer value. ### `MIN` `alias MIN = __init__[::Intable](SIMD(min_or_neg_inf[::DType]()))` Returns the minimum value of type. ### `TYPE_ID` `alias TYPE_ID = "stdlib.Int"` ## Methods ### `__init__` `__init__() -> Self` Default constructor that produces zero. `@implicit` `__init__(value: IntLiteral[value]) -> Self` Construct Int from the given IntLiteral value. **Args:** * ​value (`IntLiteral[value]`): The init value. `@implicit` `__init__(value: UInt) -> Self` Construct Int from the given UInt value. **Args:** * ​value (`UInt`): The init value. `__init__[T: Intable](value: T) -> Self` Get the Int representation of the value. **Parameters:** * ​T (`Intable`): The Intable type. **Args:** * ​value (`T`): The object to get the integral representation of. `__init__[T: IntableRaising](out self, value: T)` Get the Int representation of the value. **Parameters:** * ​T (`IntableRaising`): The Intable type. **Args:** * ​value (`T`): The object to get the integral representation of. **Raises:** If the type does not have an integral representation. `@implicit` `__init__[I: ImplicitlyIntable](value: I) -> Self` Construct Int from implicitly convertable type. **Parameters:** * ​I (`ImplicitlyIntable`): The type that is implicitly convertable to an `Int`. **Args:** * ​value (`I`): The init value. `__init__(out self, value: StringSlice[origin], base: UInt = UInt(10))` Parses and returns the given string as an integer in the given base. If base is set to 0, the string is parsed as an Integer literal, with the following considerations: * '0b' or '0B' prefix indicates binary (base 2) * '0o' or '0O' prefix indicates octal (base 8) * '0x' or '0X' prefix indicates hexadecimal (base 16) * Without a prefix, it's treated as decimal (base 10) Examples: > > > Int("32") > > > 32 > > > Int("FF", 16) > > > 255 > > > Int("0xFF", 0) > > > 255 > > > Int("0b1010", 0) > > > 10 Notes: This follows [Python's integer literals](https://docs.python.org/3/reference/lexical_analysis.html#integers). **Args:** * ​value (`StringSlice[origin]`): A string to be parsed as an integer in the given base. * ​base (`UInt`): Base used for conversion, value must be between 2 and 36, or 0. **Raises:** If the given string cannot be parsed as an integer value or if an incorrect base is provided. ### `__bool__` `__bool__(self) -> Bool` Convert this Int to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__neg__` `__neg__(self) -> Self` Return -self. **Returns:** The -self value. ### `__pos__` `__pos__(self) -> Self` Return +self. **Returns:** The +self value. ### `__invert__` `__invert__(self) -> Self` Return \~self. **Returns:** The \~self value. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Compare this Int to the RHS using LT comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is less-than the RHS Int and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compare this Int to the RHS using LE comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is less-or-equal than the RHS Int and False otherwise. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compare this Int to the RHS using EQ comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is equal to the RHS Int and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compare this Int to the RHS using NE comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is non-equal to the RHS Int and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Compare this Int to the RHS using GT comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is greater than the RHS Int and False otherwise. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Compare this Int to the RHS using GE comparison. **Args:** * ​rhs (`Self`): The other Int to compare against. **Returns:** True if this Int is greater-or-equal than the RHS Int and False otherwise. ### `__add__` `__add__(self, rhs: Self) -> Self` Return `self + rhs`. **Args:** * ​rhs (`Self`): The value to add. **Returns:** `self + rhs` value. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Return `self - rhs`. **Args:** * ​rhs (`Self`): The value to subtract. **Returns:** `self - rhs` value. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Return `self * rhs`. **Args:** * ​rhs (`Self`): The value to multiply with. **Returns:** `self * rhs` value. ### `__truediv__` `__truediv__(self, rhs: Self) -> SIMD[float64, 1]` Return the floating point division of `self` and `rhs`. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `Float64(self)/Float64(rhs)` value. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Return the division of `self` and `rhs` rounded down to the nearest integer. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `floor(self/rhs)` value. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Return the remainder of self divided by rhs. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Self) -> Self` Return the value raised to the power of the given exponent. Computes the power of an integer using the Russian Peasant Method. **Args:** * ​exp (`Self`): The exponent value. **Returns:** The value of `self` raised to the power of `exp`. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Return `self rhs (`Self`): The value to shift with. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Return `self >> rhs`. **Args:** * ​rhs (`Self`): The value to shift with. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: Self) -> Self` Return `self & rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Return `self | rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Return `self ^ rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self ^ rhs`. ### `__radd__` `__radd__(self, value: Self) -> Self` Return `value + self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value + self`. ### `__rsub__` `__rsub__(self, value: Self) -> Self` Return `value - self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value - self`. ### `__rmul__` `__rmul__(self, value: Self) -> Self` Return `value * self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value * self`. ### `__rfloordiv__` `__rfloordiv__(self, value: Self) -> Self` Return `value // self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value // self`. ### `__rmod__` `__rmod__(self, value: Self) -> Self` Return `value % self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value % self`. ### `__rpow__` `__rpow__(self, value: Self) -> Self` Return `pow(value,self)`. **Args:** * ​value (`Self`): The other value. **Returns:** `pow(value,self)`. ### `__rlshift__` `__rlshift__(self, value: Self) -> Self` Return `value value (`Self`): The other value. **Returns:** `value ### `__rrshift__` `__rrshift__(self, value: Self) -> Self` Return `value >> self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value >> self`. ### `__rand__` `__rand__(self, value: Self) -> Self` Return `value & self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value & self`. ### `__ror__` `__ror__(self, value: Self) -> Self` Return `value | self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value | self`. ### `__rxor__` `__rxor__(self, value: Self) -> Self` Return `value ^ self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value ^ self`. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Compute `self + rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__isub__` `__isub__(mut self, rhs: Self)` Compute `self - rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imul__` `__imul__(mut self, rhs: Self)` Compute self\*rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` Compute `self / rhs`, convert to int, and save the result in self. Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields the same as `__ifloordiv__`. **Args:** * ​rhs (`Self`): The RHS value. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` Compute `self // rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imod__` `__imod__(mut self, rhs: Self)` Compute `self % rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ipow__` `__ipow__(mut self, rhs: Self)` Compute `pow(self, rhs)` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Compute `self rhs (`Self`): The RHS value. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Compute `self >> rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__iand__` `__iand__(mut self, rhs: Self)` Compute `self & rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Compute `self ^ rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ior__` `__ior__(mut self, rhs: Self)` Compute self|rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `get_type_name` `static get_type_name() -> String` Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `__divmod__` `__divmod__(self, rhs: Self) -> Tuple[Int, Int]` Computes both the quotient and remainder using integer division. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The quotient and remainder as a tuple `(self // rhs, self % rhs)`. ### `__as_bool__` `__as_bool__(self) -> Bool` Convert this Int to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self) -> Self` Gets the integral value (this is an identity function for Int). **Returns:** The value as an integer. ### `__abs__` `__abs__(self) -> Self` Return the absolute value of the Int value. **Returns:** The absolute value. ### `__ceil__` `__ceil__(self) -> Self` Return the ceiling of the Int value, which is itself. **Returns:** The Int value itself. ### `__floor__` `__floor__(self) -> Self` Return the floor of the Int value, which is itself. **Returns:** The Int value itself. ### `__round__` `__round__(self) -> Self` Return the rounded value of the Int value, which is itself. **Returns:** The Int value itself. `__round__(self, ndigits: Self) -> Self` Return the rounded value of the Int value, which is itself. **Args:** * ​ndigits (`Self`): The number of digits to round to. **Returns:** The Int value itself if ndigits >= 0 else the rounded value. ### `__trunc__` `__trunc__(self) -> Self` Return the truncated Int value, which is itself. **Returns:** The Int value itself. ### `__ceildiv__` `__ceildiv__(self, denominator: Self) -> Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `is_power_of_two` `is_power_of_two(self) -> Bool` Check if the integer is a (non-zero) power of two. **Returns:** True if the integer is a power of two, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this integer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `write_padded` `write_padded[W: Writer](self, mut writer: W, width: Self)` Write the int right-aligned to a set padding. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. * ​width (`Self`): The amount to pad to the left. ### `__str__` `__str__(self) -> String` Get the integer as a string. **Returns:** A string representation. ### `__repr__` `__repr__(self) -> String` Get the integer as a string. Returns the same `String` as `__str__`. **Returns:** A string representation. ### `__hash__` `__hash__(self) -> UInt` Hash the int using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this int value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. --- ## int_literal Implements the IntLiteral class. ## Structs * [​`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral): This type represents a static integer literal value with infinite precision. This type is a compile-time construct which stores its value as a parameter. It is typically materialized into other types (like `Int`) for use at runtime. This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types. --- ## int_tuple Hierarchical integer tuple data structures for high-performance tensor operations. This module provides a flexible, memory-efficient implementation of nested integer tuples optimized for tensor shape, stride, and index operations in high-performance computing. The core data structures support both flat and hierarchical representations with efficient memory sharing and zero-copy views. Key components: * `IntArray`: Low-level register-passable array with direct memory management * `IntTuple`: Hierarchical nested tuple with efficient memory layout and operations * Utility functions for tensor shape manipulation, coordinate transformations, and layout operations Performance features: * Register-passable data structures for optimal compiler optimizations * Zero-copy views for efficient memory sharing * Specialized memory layout for nested structures * Optimized algorithms for common tensor operations Common operations: * Shape manipulation: `flatten`, `to_nest`, `apply`, `product`, `sum` * Coordinate transformations: `idx2crd`, `crd2idx` * Layout operations: `compact_order`, `prefix_product` * Structural comparisons: `congruent`, `compatible`, `weakly_congruent` Example usage: ```mojo from layout import IntTuple from layout.int_tuple import flatten, compact_order, size # Create nested tuples var shape = IntTuple(2, IntTuple(3, 4), 5) # Represents shape (2, (3, 4), 5) # Flatten a nested tuple var flat = flatten(shape) # Results in (2, 3, 4, 5) # Create compact strides for a given shape and order var order = IntTuple(1, IntTuple(2, 3), 4) var strides = compact_order(shape, order) # Results in (1, (2, 6), 24) # Calculate total size (product of all elements) var total_size = size(shape) # Results in 120 ``` ## Aliases ### `INT_TUPLE_VALIDATION` `alias INT_TUPLE_VALIDATION = False` ### `IntList` `alias IntList = List[Int, True]` A type alias for a List of integers with ownership. This alias defines a List that contains Int values and has ownership of its data. It's used throughout the module for storing and manipulating collections of integers, particularly for operations like permutations and indices. ### `UNKNOWN_VALUE` `alias UNKNOWN_VALUE = -1` Special value indicating an unknown or unspecified dimension. This constant is used throughout the `IntTuple` system to represent dimensions that are not known at compile time or have not been specified. ## Structs * [​`IntArray`](./IntArray): A memory-efficient, register-passable array of integers. * [​`IntTuple`](./IntTuple): A hierarchical, nested tuple of integers with efficient memory management. ## Functions * [​`abs`](./abs): Compute the absolute value of each element in an `IntTuple`. * [​`apply`](./apply): Apply a function to each integer value in an `IntTuple`. * [​`apply_predicate`](./apply_predicate): Apply a predicate function recursively to two `IntTuple`s. * [​`apply_zip`](./apply_zip): Apply a function to pairs of elements from two `IntTuple`s. * [​`compact_order`](./compact_order): Create a compact stride based on shape and order. * [​`compatible`](./compatible): Test if two shapes are compatible for tensor operations. * [​`congruent`](./congruent): Test if two `IntTuple`s have the same hierarchical structure. * [​`crd2idx`](./crd2idx): Map a logical coordinate to a linear index. * [​`depth`](./depth): Calculates the maximum nesting depth of an `IntTuple`. * [​`fill_like`](./fill_like): Creates an `IntTuple` with the same structure as the source but filled with a specified value. * [​`flatten`](./flatten): Flatten a nested `IntTuple` into a single-level `IntTuple`. * [​`idx2crd`](./idx2crd): Converts a linear index to a coordinate tuple within a given shape. * [​`idx2crd2`](./idx2crd2): Convert a linear index to coordinates. * [​`inner_product`](./inner_product): Compute the inner product of two `IntTuple`s. * [​`is_flat`](./is_flat): Check if an `IntTuple` is flat. * [​`is_int`](./is_int): Check if an `IntTuple` represents a single integer value. * [​`is_tuple`](./is_tuple): Check if an `IntTuple` represents a nested tuple. * [​`mul`](./mul): Multiply each element in an `IntTuple` by a scalar value. * [​`prefix_product`](./prefix_product): Compute the exclusive prefix product of an `IntTuple`. * [​`product`](./product): Calculate the product of all values in an `IntTuple`. * [​`product_each`](./product_each): Compute the product of elements in each sub-tuple of an `IntTuple`. * [​`propagate_unknown`](./propagate_unknown): Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`. * [​`reduce`](./reduce): Apply a reduction function to an `IntTuple` with an initial value. * [​`reverse`](./reverse): Reverses the order of elements in an `IntTuple`, recursively. * [​`shallow_apply`](./shallow_apply): Apply a function to each top-level element of an `IntTuple`. * [​`shape_div`](./shape_div): Performs division operation between shape tuples. * [​`signum`](./signum): Calculate the sign of an integer. * [​`size`](./size): Calculate the total size (product of all elements) of an `IntTuple`. * [​`sorted`](./sorted): Sort an IntTuple using the provided comparison function. * [​`sum`](./sum): Calculate the sum of all values in an `IntTuple`. * [​`to_nest`](./to_nest): Nests a flat `IntTuple` according to the structure of a nested `IntTuple`. * [​`to_unknown`](./to_unknown): Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`. * [​`tuple_max`](./tuple_max): Calculate the maximum value in an `IntTuple`. * [​`tuple_min`](./tuple_min): Compute the element-wise minimum of two `IntTuple`s. * [​`weakly_compatible`](./weakly_compatible): Test if shape A is weakly compatible with shape B. * [​`weakly_congruent`](./weakly_congruent): Test if two IntTuples have similar hierarchical structures. * [​`zip`](./zip): Create a zip iterator from an array of `IntTuple` pointers. --- ## Intable The `Intable` trait describes a type that can be converted to an Int. Any type that conforms to `Intable` or [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) can construct an `Int`. This trait requires the type to implement the `__int__()` method. For example: ```mojo struct Foo(Intable): var i: Int fn __int__(self) -> Int: return self.i ``` Now you can construct an `Int`: ```mojo foo = Foo(42) assert_equal(Int(foo), 42) ``` **Note:** If the `__int__()` method can raise an error, use the [`IntableRaising`](/mojo/stdlib/builtin/int/intableraising) trait instead. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## IntableRaising The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error. Any type that conforms to [`Intable`](/mojo/stdlib/builtin/int/Intable) or `IntableRaising` can construct an `Int`. This trait requires the type to implement the `__int__()` method, which can raise an error. For example: ```mojo struct Foo(IntableRaising): var i: Int fn __int__(self) raises -> Int: return self.i ``` Now you can construct an `Int`: ```mojo foo = Foo(42) assert_equal(Int(foo), 42) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the type. **Raises:** If the type does not have an integral representation. --- ## IntArray `@register_passable` `struct IntArray` A memory-efficient, register-passable array of integers. `IntArray` provides a low-level implementation of a dynamically-sized integer array with direct memory management. It supports both owned and non-owned (view) modes for efficient memory sharing without copying. This struct serves as the underlying storage mechanism for `IntTuple` and related data structures, optimized for high-performance tensor operations. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(size: Int = 0) -> Self` Initialize a new owned `IntArray` with the specified size. **Args:** * ​size (`Int`): Number of integers to allocate space for. Defaults to 0. `__init__(*, non_owned: Self, offset: Int = 0) -> Self` Create a non-owned view into another `IntArray`. Creates a view starting at the specified offset in the source array. The resulting array doesn't own the memory and won't free it when destroyed. **Args:** * ​non\_owned (`Self`): The source array to create a view into. * ​offset (`Int`): Starting position in the source array. Defaults to 0. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Initialize by copying an existing `IntArray`. For owned arrays, this performs a deep copy of the data. For non-owned arrays, this creates another view of the same data (zero-copy operation). **Args:** * ​existing (`Self`): The source array to copy from. ### `__del__` `__del__(owned self)` Destroy the `IntArray` and free its memory if owned. Only frees memory for owned arrays (positive \_size) to prevent double-free errors with views. ### `__getitem__` `__getitem__(self, idx: Int) -> Int` Access an element at the specified index. Note: Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​idx (`Int`): Zero-based index of the element to access. **Returns:** The integer value at the specified index. ### `__setitem__` `__setitem__(mut self, idx: Int, value: Int)` Set the value at the specified index. Note: Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​idx (`Int`): Zero-based index of the element to modify. * ​value (`Int`): The integer value to store at the specified index. ### `owning` `owning(self) -> Bool` Check if this `IntArray` owns its memory. **Returns:** True if this array owns its memory (positive \_size), False if it's a view (negative \_size). ### `size` `size(self) -> Int` Get the number of elements in the array. **Returns:** The number of elements in the array, regardless of ownership status. ### `copy_from` `copy_from(mut self, offset: Int, source: Self, size: Int)` Copy elements from another `IntArray`. **Args:** * ​offset (`Int`): Destination offset in this array. * ​source (`Self`): Source array to copy from. * ​size (`Int`): Number of elements to copy. `copy_from(mut self, dst_offset: Int, source: Self, src_offset: Int, size: Int)` Copy elements from another IntArray with source offset. **Args:** * ​dst\_offset (`Int`): Destination offset in this array. * ​source (`Self`): Source array to copy from. * ​src\_offset (`Int`): Source offset in the source array. * ​size (`Int`): Number of elements to copy. --- ## intel_amx_intrinsics ## Aliases ### `void` `alias void = invalid` ## Structs * [​`__tile`](./__tile): An AMX tile representation * [​`tileconfig`](./tileconfig): ## Functions * [​`init_intel_amx`](./init_intel_amx): --- ## interfaces General interface for Attention. ## `AttentionImpl` {#max.nn.attention.interfaces.AttentionImpl} > *class* max.nn.attention.interfaces.AttentionImpl(n\_heads, kv\_params, wqkv, wo, scale) A generalized attention interface, that will be used upstream by a general Transformer. We would expect a separate subclass, articulating each variation of Attention: * AttentionWithRope * AttentionWithAlibi * VanillaAttentionWithCausalMask * … There are a series of shared attributes, however, more may be needed for each individual variant. For example, we may introduce an OptimizedRotaryEmbedding class for the AttentionWithRope class: ```python @dataclass class AttentionWithRope(AttentionImpl): rope: OptimizedRotaryEmbedding ... ``` We expect the `__call__` abstractmethod to remain relatively consistent, however the `**kwargs` argument is exposed, allowing you to leverage additional arguments for each particular variant. For example, we may introduce an VanillaAttentionWithCausalMask class, which includes an attention mask: ```python @dataclass class VanillaAttentionWithCausalMask(AttentionImpl): ... def __call__( self, x: TensorValueLike, kv_collection: ContinuousBatchingKVCacheCollection, valid_lengths: TensorValueLike, **kwargs, ) -> tuple[TensorValue, ContinuousBatchingKVCacheCollection]: ... if "attn_mask" not in kwargs: raise ValueError("attn_mask not provided to VanillaAttentionWithCausalMask") # Which we can then use the attention mask downstream like so: op( attn_mask = kwargs["attn_mask"] ) ``` **Parameters:** * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **wqkv** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) ) * **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ### `kv_params` {#max.nn.attention.interfaces.AttentionImpl.kv_params} > kv\_params\*: [KVCacheParams](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams)\* KV Cache Params, including the number of kv heads, the head dim, and data type. ### `n_heads` {#max.nn.attention.interfaces.AttentionImpl.n_heads} > n\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\* The number of attention heads. ### `scale` {#max.nn.attention.interfaces.AttentionImpl.scale} > scale\*: [float](https://docs.python.org/3/library/functions.html#float)\* The scale factor for the attention. ### `wo` {#max.nn.attention.interfaces.AttentionImpl.wo} > wo\*: [LinearV1](../linear.md#max.nn.linear.LinearV1)\* A linear layer for the output projection. ### `wqkv` {#max.nn.attention.interfaces.AttentionImpl.wqkv} > wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The concatenation of q, k, and v weight vectors. ## `AttentionImplQKV` {#max.nn.attention.interfaces.AttentionImplQKV} > *class* max.nn.attention.interfaces.AttentionImplQKV(n\_heads, kv\_params, wq, wk, wv, wo, scale) A generalized attention interface, that will be used upstream by a general Transformer. We would expect a separate subclass, articulating each variation of Attention: * AttentionWithRope * AttentionWithAlibi * VanillaAttentionWithCausalMask * … There are a series of shared attributes, however, more may be needed for each individual variant. For example, we may introduce an OptimizedRotaryEmbedding class for the AttentionWithRope class: ```python @dataclass class AttentionWithRope(AttentionImpl): rope: OptimizedRotaryEmbedding ... ``` We expect the `__call__` abstractmethod to remain relatively consistent, however the `**kwargs` argument is exposed, allowing you to leverage additional arguments for each particular variant. For example, we may introduce an VanillaAttentionWithCausalMask class, which includes an attention mask: ```python @dataclass class VanillaAttentionWithCausalMask(AttentionImpl): ... def __call__( self, x: TensorValueLike, kv_collection: ContinuousBatchingKVCacheCollection, valid_lengths: TensorValueLike, **kwargs, ) -> tuple[TensorValue, ContinuousBatchingKVCacheCollection]: ... if "attn_mask" not in kwargs: raise ValueError("attn_mask not provided to VanillaAttentionWithCausalMask") # Which we can then use the attention mask downstream like so: op( attn_mask = kwargs["attn_mask"] ) ``` **Parameters:** * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **wq** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wk** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wv** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ### `kv_params` {#max.nn.attention.interfaces.AttentionImplQKV.kv_params} > kv\_params\*: [KVCacheParams](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams)\* KV Cache Params, including the number of kv heads, the head dim, and data type. ### `n_heads` {#max.nn.attention.interfaces.AttentionImplQKV.n_heads} > n\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\* The number of attention heads. ### `scale` {#max.nn.attention.interfaces.AttentionImplQKV.scale} > scale\*: [float](https://docs.python.org/3/library/functions.html#float)\* The scale factor for the attention. ### `wk` {#max.nn.attention.interfaces.AttentionImplQKV.wk} > wk\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* The k weight vector. ### `wo` {#max.nn.attention.interfaces.AttentionImplQKV.wo} > wo\*: [LinearV1](../linear.md#max.nn.linear.LinearV1)\* A linear layer for the output projection. ### `wq` {#max.nn.attention.interfaces.AttentionImplQKV.wq} > wq\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* The q weight vector. ### `wv` {#max.nn.attention.interfaces.AttentionImplQKV.wv} > wv\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* The v weight vector. ## `DistributedAttentionImpl` {#max.nn.attention.interfaces.DistributedAttentionImpl} > *class* max.nn.attention.interfaces.DistributedAttentionImpl A generalized Distributed attention interface. --- ## interpolate_point_1d `interpolate_point_1d[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType, interpolation_mode: InterpolationMode](interpolator: Interpolator[interpolation_mode], dim: Int, out_coords: IndexList[rank], scale: SIMD[float32, 1], input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])` --- ## InterpolationMode `struct InterpolationMode` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Linear` `alias Linear = InterpolationMode(0)` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## Interpolator `@register_passable(trivial)` `struct Interpolator[mode: InterpolationMode]` ## Fields * ​cubic\_coeff (`SIMD[float32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(cubic_coeff: SIMD[float32, 1]) -> Self` `__init__() -> Self` ### `filter_length` `static filter_length() -> Int` ### `filter` `filter(self, x: SIMD[float32, 1]) -> SIMD[float32, 1]` --- ## interval A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals. It maintains intervals sorted by their low endpoints and augments each node with a `max_high` attribute, representing the maximum high endpoint in its subtree. This `max_high` value enables efficient overlap searching by pruning the search space. Self-balancing mechanisms, such as Red-Black or AVL trees, ensure logarithmic time complexity for operations. Key Features: * Stores intervals (low, high). * Nodes ordered by `low` endpoints. * `max_high` attribute at each node for efficient overlap search. * Self-balancing (e.g., using Red-Black tree logic) for O(log n) operations. Operations: * Insertion: O(log n) - Adds a new interval, maintaining balance and updating `max_high`. * Overlap Search: O(log n) - Finds intervals overlapping a query interval using `max_high` for pruning. * Deletion: O(log n) - Removes an interval, maintaining balance and updating `max_high`. Space Complexity: O(n), where n is the number of intervals. Use Cases: * Calendar scheduling * Computational geometry * Genomics * Database indexing * Resource allocation In essence, this data structure provides a fast and efficient way to manage and query interval data, particularly for finding overlaps. ## Structs * [​`Interval`](/mojo/stdlib/collections/interval/Interval): A half-open interval \[start, end) that represents a range of values. * [​`IntervalTree`](/mojo/stdlib/collections/interval/IntervalTree): An interval tree data structure for efficient range queries. ## Traits * [​`IntervalElement`](/mojo/stdlib/collections/interval/IntervalElement): The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable. --- ## Interval `struct Interval[T: IntervalElement]` A half-open interval \[start, end) that represents a range of values. The interval includes the start value but excludes the end value. ## Parameters * ​T (`IntervalElement`): The type of the interval bounds. ## Fields * ​start (`T`): The inclusive start of the interval. * ​end (`T`): The exclusive end of the interval. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, start: T, end: T)` Initialize an interval with start and end values. **Args:** * ​start (`T`): The starting value of the interval. * ​end (`T`): The ending value of the interval. Must be greater than or equal to start. `__init__(out self, interval: Tuple[T, T], /)` Initialize an interval with a tuple of start and end values. **Args:** * ​interval (`Tuple[T, T]`): A tuple containing the start and end values. ### `__copyinit__` `__copyinit__(out self, existing: Self, /)` Create a new instance of the interval by copying the values from an existing one. **Args:** * ​existing (`Self`): The interval to copy values from. ### `__moveinit__` `__moveinit__(out self, owned existing: Self, /)` Create a new instance of the interval by moving the values from an existing one. **Args:** * ​existing (`Self`): The interval to move values from. ### `__bool__` `__bool__(self) -> Bool` Returns whether this interval is empty. **Returns:** True if the interval is not empty (start ### `__lt__` `__lt__(self, other: Self) -> Bool` Returns whether this interval is less than another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's start is less than the other interval's start. ### `__le__` `__le__(self, other: Self) -> Bool` Returns whether this interval is less than or equal to another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's start is less than or equal to the other interval's start. ### `__eq__` `__eq__(self, other: Self) -> Bool` Returns whether this interval equals another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if both intervals have the same start and end values. ### `__ne__` `__ne__(self, other: Self) -> Bool` Returns whether this interval is not equal to another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if the intervals are not equal, False if they are equal. ### `__gt__` `__gt__(self, other: Self) -> Bool` Returns whether this interval is greater than another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's end is greater than the other interval's end. ### `__ge__` `__ge__(self, other: Self) -> Bool` Returns whether this interval is greater than or equal to another interval. **Args:** * ​other (`Self`): The interval to compare with. **Returns:** True if this interval's end is greater than or equal to the other interval's end. ### `__contains__` `__contains__(self, other: T) -> Bool` Returns whether a value is contained within this interval. **Args:** * ​other (`T`): The value to check. **Returns:** True if the value is within the interval bounds, False otherwise. `__contains__(self, other: Self) -> Bool` Returns whether another interval is fully contained within this interval. **Args:** * ​other (`Self`): The interval to check. **Returns:** True if the other interval is fully contained within this interval, False otherwise. ### `overlaps` `overlaps(self, other: Self) -> Bool` Returns whether this interval overlaps with another interval. **Args:** * ​other (`Self`): The interval to check for overlap with. **Returns:** True if the intervals overlap, False otherwise. ### `union` `union(self, other: Self) -> Self` Returns the union of this interval and another interval. **Args:** * ​other (`Self`): The interval to union with. **Returns:** The union of this interval and the other interval. ### `intersection` `intersection(self, other: Self) -> Self` Returns the intersection of this interval and another interval. **Args:** * ​other (`Self`): The interval to intersect with. **Returns:** The intersection of this interval and the other interval. ### `__len__` `__len__(self) -> Int` Returns the length of this interval. **Returns:** The difference between end and start values as an integer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes this interval to a writer in the format '(start, end)'. **Parameters:** * ​W (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`W`): The writer to write the interval to. ### `__str__` `__str__(self) -> String` Returns a string representation of this interval. **Returns:** A string in the format '(start, end)' representing this interval. ### `__repr__` `__repr__(self) -> String` Returns a string representation of this interval suitable for debugging. **Returns:** A string in the format '(start, end)' representing this interval. --- ## IntervalElement The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable. ## Implemented traits `AnyType`, `Comparable`, `Copyable`, `EqualityComparable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__lt__` `__lt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than `rhs`. ### `__le__` `__le__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than or equal to `rhs`. ### `__eq__` `__eq__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are equal according to the type's definition of equality, False otherwise. ### `__ne__` `__ne__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are not equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are not equal according to the type's definition of equality, False otherwise. ### `__gt__` `__gt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than `rhs`. ### `__ge__` `__ge__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is greater than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is greater than or equal to `rhs`. ### `__sub__` `__sub__(self: _Self, rhs: _Self) -> _Self` Subtracts rhs from self, must be implemented in concrete types. **Args:** * ​rhs (`_Self`): The value to subtract from self. **Returns:** The result of subtracting rhs from self. ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. ### `write_to` `write_to[W: Writer](self: _Self, mut writer: W)` Formats the string representation of this type to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The type conforming to `Writable`. --- ## IntervalTree `struct IntervalTree[T: IntervalElement, U: Copyable & Movable & Stringable & Comparable]` An interval tree data structure for efficient range queries. ## Parameters * ​T (`IntervalElement`): The type of the interval bounds, must support subtraction, integer conversion, string conversion, comparison and collection operations. * ​U (`Copyable & Movable & Stringable & Comparable`): The type of the associated data, must support string conversion and collection operations. ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self)` Initializes an empty IntervalTree. ### `insert` `insert(mut self, interval: Tuple[T, T], data: U)` Insert a new interval into the tree using a tuple representation. **Args:** * ​interval (`Tuple[T, T]`): A tuple containing the start and end values of the interval. * ​data (`U`): The data value to associate with this interval. `insert(mut self, interval: Interval[T], data: U)` Insert a new interval into the tree. This method inserts a new interval and its associated data into the interval tree. It maintains the binary search tree property based on interval start times and updates the tree structure to preserve red-black tree properties. **Args:** * ​interval (`Interval[T]`): The interval to insert into the tree. * ​data (`U`): The data value to associate with this interval. ### `__str__` `__str__(self) -> String` Returns a string representation of the interval tree. **Returns:** A string representation of the interval tree. ### `__repr__` `__repr__(self) -> String` Returns a string representation of the interval tree suitable for debugging. **Returns:** A string representation of the interval tree. ### `write_to` `write_to[w: Writer](self, mut writer: w)` Writes the interval tree to a writer. **Parameters:** * ​w (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`w`): The writer to write the interval tree to. ### `depth` `depth(self) -> Int` Returns the depth of the interval tree. **Returns:** The depth of the interval tree. ### `transplant` `transplant(mut self, mut u: UnsafePointer[_IntervalNode[T, U]], mut v: UnsafePointer[_IntervalNode[T, U]])` Transplants the subtree rooted at node u with the subtree rooted at node v. **Args:** * ​u (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant. * ​v (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant to. ### `search` `search(self, interval: Tuple[T, T]) -> List[U]` Searches for intervals overlapping with the given tuple. **Args:** * ​interval (`Tuple[T, T]`): The interval tuple (start, end). **Returns:** A list of data associated with overlapping intervals. `search(self, interval: Interval[T]) -> List[U]` Searches for intervals overlapping with the given interval. **Args:** * ​interval (`Interval[T]`): The interval to search. **Returns:** A list of data associated with overlapping intervals. --- ## IntLiteral `@register_passable(trivial)` `struct IntLiteral[value: !pop.int_literal]` This type represents a static integer literal value with infinite precision. This type is a compile-time construct which stores its value as a parameter. It is typically materialized into other types (like `Int`) for use at runtime. This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types. ## Parameters * ​value (`!pop.int_literal`): The underlying integer value. ## Implemented traits `AnyType`, `Boolable`, `Ceilable`, `Copyable`, `Floorable`, `ImplicitlyBoolable`, `ImplicitlyIntable`, `Indexer`, `Intable`, `Movable`, `Stringable`, `Truncable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Constructor for any value. ### `__bool__` `__bool__(self) -> Bool` Convert this IntLiteral to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__neg__` `__neg__(self) -> IntLiteral[(0 - value)]` Return -self. **Returns:** The -self value. ### `__pos__` `__pos__(self) -> Self` Return +self. **Returns:** The +self value. ### `__invert__` `__invert__(self) -> IntLiteral[(value ^ -1)]` Return \~self. **Returns:** The \~self value. ### `__lt__` `__lt__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using LT comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is less-than the RHS IntLiteral and False otherwise. ### `__le__` `__le__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using LE comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is less-or-equal than the RHS IntLiteral and False otherwise. ### `__eq__` `__eq__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using EQ comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is equal to the RHS IntLiteral and False otherwise. ### `__ne__` `__ne__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using NE comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is non-equal to the RHS IntLiteral and False otherwise. ### `__gt__` `__gt__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using GT comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is greater-than the RHS IntLiteral and False otherwise. ### `__ge__` `__ge__(self, rhs: IntLiteral[value]) -> Bool` Compare this IntLiteral to the RHS using GE comparison. **Args:** * ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against. **Returns:** True if this IntLiteral is greater-or-equal than the RHS IntLiteral and False otherwise. ### `__add__` `__add__(self, rhs: IntLiteral[value]) -> IntLiteral[(value + value)]` Return `self + rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to add. **Returns:** `self + rhs` value. ### `__sub__` `__sub__(self, rhs: IntLiteral[value]) -> IntLiteral[(value - value)]` Return `self - rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to subtract. **Returns:** `self - rhs` value. ### `__mul__` `__mul__(self, rhs: IntLiteral[value]) -> IntLiteral[(value * value)]` Return `self * rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to multiply with. **Returns:** `self * rhs` value. ### `__floordiv__` `__floordiv__(self, rhs: IntLiteral[value]) -> IntLiteral[(value // value)]` Return `self // rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to divide with. **Returns:** `self // rhs` value. ### `__mod__` `__mod__(self, rhs: IntLiteral[value]) -> IntLiteral[(value % value)]` Return the remainder of self divided by rhs. **Args:** * ​rhs (`IntLiteral[value]`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__lshift__` `__lshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value Return `self rhs (`IntLiteral[value]`): The value to shift with. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value >> value)]` Return `self >> rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The value to shift with. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: IntLiteral[value]) -> IntLiteral[(value & value)]` Return `self & rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: IntLiteral[value]) -> IntLiteral[(value | value)]` Return `self | rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: IntLiteral[value]) -> IntLiteral[(value ^ value)]` Return `self ^ rhs`. **Args:** * ​rhs (`IntLiteral[value]`): The RHS value. **Returns:** `self ^ rhs`. ### `__as_bool__` `__as_bool__(self) -> Bool` Convert this IntLiteral to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__int__` `__int__(self) -> Int` Convert from IntLiteral to Int. **Returns:** The value as an integer of platform-specific width. ### `__as_int__` `__as_int__(self) -> Int` Implicitly convert to an Int. **Returns:** An integral value that represents this object. ### `__uint__` `__uint__(self) -> UInt` Convert from IntLiteral to UInt. **Returns:** The value as an unsigned integer of platform-specific width. ### `__ceil__` `__ceil__(self) -> Self` Return the ceiling of the IntLiteral value, which is itself. **Returns:** The IntLiteral value itself. ### `__floor__` `__floor__(self) -> Self` Return the floor of the IntLiteral value, which is itself. **Returns:** The IntLiteral value itself. ### `__trunc__` `__trunc__(self) -> Self` Return the truncated of the IntLiteral value, which is itself. **Returns:** The IntLiteral value itself. ### `__str__` `__str__(self) -> String` Convert from IntLiteral to String. **Returns:** The value as a string. ### `__ceildiv__` `__ceildiv__(self, denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`IntLiteral[value]`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `__index__` `__index__(self) -> index` Convert from IntLiteral to index. **Returns:** The corresponding \_\_mlir\_type.index value, interpreting as signed. --- ## intrinsics Provides low-level GPU intrinsic operations and memory access primitives. Implements hardware-specific intrinsics that map directly to GPU assembly instructions, focusing on NVIDIA GPU architectures. Includes: * Global memory load/store operations with cache control * Warp-level primitives and synchronization * Memory fence and barrier operations * Atomic operations and memory ordering primitives These low-level primitives should be used carefully as they correspond directly to hardware instructions and require understanding of the underlying GPU architecture. ## Structs * [​`Scope`](/mojo/stdlib/gpu/intrinsics/Scope): Represents memory synchronization scope levels for GPU memory operations. ## Functions * [​`buffer_load`](/mojo/stdlib/gpu/intrinsics/buffer_load): Loads data from global memory into a SIMD register. * [​`buffer_load_store_lds`](/mojo/stdlib/gpu/intrinsics/buffer_load_store_lds): Loads four bytes from global memory ands writes them to shared memory. * [​`buffer_store`](/mojo/stdlib/gpu/intrinsics/buffer_store): Stores a register variable to global memory. * [​`byte_permute`](/mojo/stdlib/gpu/intrinsics/byte_permute): Permutes bytes from two 32-bit integers based on a control mask. * [​`ldg`](/mojo/stdlib/gpu/intrinsics/ldg): Load data from global memory through the non-coherent cache. * [​`load_acquire`](/mojo/stdlib/gpu/intrinsics/load_acquire): Performs an atomic load operation with acquire memory ordering semantics. * [​`load_volatile`](/mojo/stdlib/gpu/intrinsics/load_volatile): Performs a volatile load operation that cannot be optimized away. * [​`lop`](/mojo/stdlib/gpu/intrinsics/lop): Performs an arbitrary logical operation on 3 inputs using a lookup table. * [​`make_buffer_resource`](/mojo/stdlib/gpu/intrinsics/make_buffer_resource): Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations. * [​`mulhi`](/mojo/stdlib/gpu/intrinsics/mulhi): Calculates the most significant 32 bits of the product of two 16-bit unsigned integers. * [​`mulwide`](/mojo/stdlib/gpu/intrinsics/mulwide): Performs a wide multiplication of two 32-bit unsigned integers. * [​`store_release`](/mojo/stdlib/gpu/intrinsics/store_release): Performs an atomic store with release memory ordering semantics. * [​`store_volatile`](/mojo/stdlib/gpu/intrinsics/store_volatile): Performs a volatile store operation that cannot be optimized away. * [​`threadfence`](/mojo/stdlib/gpu/intrinsics/threadfence): Enforces ordering of memory operations across threads. * [​`warpgroup_reg_alloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_alloc): Allocates additional registers for the executing warp group. * [​`warpgroup_reg_dealloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_dealloc): Deallocates additional registers for the executing warp group. --- ## intrinsics Defines intrinsics. You can import these APIs from the `sys` package. For example: ```mojo from sys import PrefetchLocality ``` ## Aliases ### `block_dim` `alias block_dim = _BlockDim()` ### `block_id_in_cluster` `alias block_id_in_cluster = _Cluster_BlockIdx()` ### `block_idx` `alias block_idx = _BlockIdx()` ### `cluster_dim` `alias cluster_dim = _ClusterDim()` ### `cluster_idx` `alias cluster_idx = _ClusterIdx()` ### `global_idx` `alias global_idx = _GridIdx()` ### `grid_dim` `alias grid_dim = _GridDim()` ### `thread_idx` `alias thread_idx = _ThreadIdx()` ## Structs * [​`PrefetchCache`](/mojo/stdlib/sys/intrinsics/PrefetchCache): Prefetch cache type. * [​`PrefetchLocality`](/mojo/stdlib/sys/intrinsics/PrefetchLocality): The prefetch locality. * [​`PrefetchOptions`](/mojo/stdlib/sys/intrinsics/PrefetchOptions): Collection of configuration parameters for a prefetch intrinsic call. * [​`PrefetchRW`](/mojo/stdlib/sys/intrinsics/PrefetchRW): Prefetch read or write. ## Functions * [​`assume`](/mojo/stdlib/sys/intrinsics/assume): Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code. * [​`ballot`](/mojo/stdlib/sys/intrinsics/ballot): Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask. * [​`compressed_store`](/mojo/stdlib/sys/intrinsics/compressed_store): Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`. * [​`expect`](/mojo/stdlib/sys/intrinsics/expect): Provides information about expected (the most probable) value of `val`, which can be used by optimizers. * [​`gather`](/mojo/stdlib/sys/intrinsics/gather): Reads scalar values from a SIMD vector, and gathers them into one vector. * [​`implicitarg_ptr`](/mojo/stdlib/sys/intrinsics/implicitarg_ptr): Get a pointer to AMD's implicit arguments table. * [​`lane_id`](/mojo/stdlib/sys/intrinsics/lane_id): Returns the lane ID of the current thread. * [​`likely`](/mojo/stdlib/sys/intrinsics/likely): Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers. * [​`llvm_intrinsic`](/mojo/stdlib/sys/intrinsics/llvm_intrinsic): Calls an LLVM intrinsic with the name `intrin` and return type `type`. * [​`masked_load`](/mojo/stdlib/sys/intrinsics/masked_load): Loads data from memory and return it, replacing masked lanes with values from the passthrough vector. * [​`masked_store`](/mojo/stdlib/sys/intrinsics/masked_store): Stores a value at a memory location, skipping masked lanes. * [​`prefetch`](/mojo/stdlib/sys/intrinsics/prefetch): Prefetches an instruction or data into cache before it is used. * [​`readfirstlane`](/mojo/stdlib/sys/intrinsics/readfirstlane): Get the value in the lowest active lane of the input operand. * [​`scatter`](/mojo/stdlib/sys/intrinsics/scatter): Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers. * [​`sendmsg`](/mojo/stdlib/sys/intrinsics/sendmsg): Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages. * [​`strided_load`](/mojo/stdlib/sys/intrinsics/strided_load): Loads values from addr according to a specific stride. * [​`strided_store`](/mojo/stdlib/sys/intrinsics/strided_store): Loads values from addr according to a specific stride. * [​`unlikely`](/mojo/stdlib/sys/intrinsics/unlikely): Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers. --- ## Intro to custom ops Custom operations (custom ops) extend [MAX Graph's Python](/max/model-formats#max-graph) inference APIs with custom [Mojo](/mojo/manual) kernels. Whether you need to optimize performance of functions, implement custom algorithms, or create hardware-specific versions of existing operators, custom ops provide the flexibility you need. The [custom ops](/max/api/python/graph/ops#custom) API provides complete control over MAX Graph while handling kernel integration and optimization pipelines automatically. Try it now with our [custom ops examples](https://github.com/modular/modular/tree/main/examples/custom_ops) on GitHub or follow the [Build custom ops for GPUs](/max/tutorials/build-custom-ops) tutorial and [let us know what you think](https://www.modular.com/community). ## How it works A custom op consists of two main components that work together to integrate your custom implementation into the MAX execution pipeline: 1. A custom function implementation written in Mojo that defines your computation 2. A registration process that connects your function to the graph execution system Under the hood, custom ops utilize high-level abstractions that handle memory management, device placement, and optimization. The graph compiler integrates your custom op implementation into the execution flow. For more information: - Follow the [Build custom ops for GPUs tutorial](/max/tutorials/build-custom-ops) - Learn more about [GPU programming with Mojo](/mojo/manual/gpu/basics) - Explore the [Custom ops GitHub examples](https://github.com/modular/modular/tree/main/examples/custom_ops) - Reference the [MAX Graph custom ops API](/max/api/python/graph/ops#custom) --- ## Intro to pointers A pointer is an indirect reference to one or more values stored in memory. The pointer is a value that holds an address to memory, and provides APIs to store and retrieve values to that memory. The value pointed to by a pointer is also known as a _pointee_. The Mojo standard library includes several types of pointers, which provide different sets of features. All of these pointer types are _generic_—they can point to any type of value, and the value type is specified as a parameter. For example, the following code creates an `OwnedPointer` that points to an `Int` value: ```mojo var ptr: OwnedPointer[Int] ptr = OwnedPointer(100) ``` The `ptr` variable has a value of type `OwnedPointer[Int]`. The pointer *points to* a value of type `Int`, as shown in Figure 1. ![](../images/owned-pointer-diagram.png#light) ![](../images/owned-pointer-diagram-dark.png#dark) Figure 1. Pointer and pointee Accessing the memory—to retrieve or update a value—is called _dereferencing_ the pointer. You can dereference a pointer by following the variable name with an empty pair of square brackets: ```mojo # Update an initialized value ptr[] += 10 # Access an initialized value print(ptr[]) ``` ## Pointer terminology Before we jump into the pointer types, here are a few terms you'll run across. Some of them may already be familiar to you. - **Safe pointers**: are designed to prevent memory errors. Unless you use one of the APIs that are specially designated as unsafe, you can use these pointers without worrying about memory issues like double-free or use-after-free. - **Nullable pointers**: can point to an invalid memory location (typically 0, or a “null pointer”). Safe pointers aren't nullable. - **Smart pointers**: own their pointees, which means that the value they point to may be deallocated when the pointer itself is destroyed. Non-owning pointers may point to values owned elsewhere, or may require some manual management of the value lifecycle. - **Memory allocation**: some pointer types can allocate memory to store their pointees, while other pointers can only point to pre-existing values. Memory allocation can either be implicit (that is, performed automatically when initializing a pointer with a value) or explicit. - **Uninitialized memory**: refers to memory locations that haven't been initialized with a value, which may therefore contain random data. Newly-allocated memory is uninitialized. The safe pointer types don't allow users to access memory that's uninitialized. Unsafe pointers can allocate a block of uninitialized memory locations and then initialize them one at a time. Being able to access uninitialized memory is unsafe by definition. - **Copyable types**: can be copied implicitly (for example, by assigning a value to a variable). Also called *implicitly copyable types*. ```mojo copied_ptr = ptr ``` *Explicitly copyable* types require the user to request a copy, using a constructor with a keyword argument: ```mojo copied_owned_ptr = OwnedPointer(other=owned_ptr) ``` ## Pointer types The Mojo standard library includes several pointer types with different characteristics: - [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) is a safe pointer that points to a single value that it doesn't own. - [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) is a smart pointer that points to a single value, and maintains exclusive ownership of that value. - [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer) is a reference-counted smart pointer that points to an owned value with ownership potentially shared with other instances of `ArcPointer`. - [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) points to one or more consecutive memory locations, and can refer to uninitialized memory. Table 1 summarizes the different types of pointers: | | `Pointer` | `OwnedPointer` | `ArcPointer` | `UnsafePointer` | | --- | --- | --- | --- | --- | | Safe | Yes | Yes | Yes | No | | Allocates memory | No | Implicitly 1 | Implicitly 1 | Explicitly | | Owns pointee(s) | No | Yes | Yes | No 2 | | Copyable | Yes | No 3 | Yes | Yes | | Nullable | No | No | No | Yes | | Can point to uninitialized memory | No | No | No | Yes | | Can point to multiple values (array-like access) | No | No | No | Yes | Table 1. Pointer types 1 `OwnedPointer` and `ArcPointer` implicitly allocate memory when you initialize the pointer with a value. 2 `UnsafePointer` provides unsafe methods for initializing and destroying instances of the stored type. The user is responsible for managing the lifecycle of stored values. 3 `OwnedPointer` is explicitly copyable, but explicitly copying an `OwnedPointer` copies the *stored value* into a new `OwnedPointer`. The following sections provide more details on each pointer type. ## `Pointer` The [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) type is a safe pointer that points to a initialized value that it doesn't own. Some example use cases for a `Pointer` include: - Storing a reference to a related type. For example, a list's iterator object might hold a `Pointer` back to the original list. - Passing the memory location for a single value to external code via `external_call()`. - Where you need an API to return a long-lived reference to a value. (Currently the iterators for standard library collection types like `List` return pointers.) You can construct a `Pointer` to an existing value by calling the constructor with the `to` keyword argument: ```python ptr = Pointer(to=some_value) ``` You can also create a `Pointer` by copying an existing `Pointer`. A `Pointer` carries an [`origin`](/mojo/manual/values/lifetimes) for the stored value, so Mojo can track the lifetime of the referenced value. ## `OwnedPointer` The [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) type is a smart pointer designed for cases where there is single ownership of the underlying data. An `OwnedPointer` points to a single item, which is passed in when you initialize the `OwnedPointer`. The `OwnedPointer` allocates memory and moves or copies the value into the reserved memory. ```python o_ptr = OwnedPointer(some_big_struct) ``` An owned pointer can hold almost any type of item, but the stored item must be either `Movable`, `Copyable`, or `ExplicitlyCopyable`. Since an `OwnedPointer` is designed to enforce single ownership, the pointer itself can be moved, but not copied. Note: Currently, you can't create an `Optional[OwnedPointer[T]]` because the `Optional` type only works with types that are both movable and copyable. This restricts some use cases that would otherwise be a natural fit for`OwnedPointer`, including self-referential data structures like linked lists and trees. (Until this use case is supported for `OwnedPointer`, it's possible to use`ArcPointer` where you need a smart pointer that can be `Optional`.) ## `ArcPointer` An [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer) is a reference-counted smart pointer, ideal for shared resources where the last owner for a given value may not be clear. Like an `OwnedPointer`, it points to a single value, and it allocates memory when you initialize the `ArcPointer` with a value: ```python attributesDict: Dict[String, String] = {} attributes = ArcPointer(attributesDict) ``` Unlike an `OwnedPointer`, an `ArcPointer` can be freely copied. All instances of a given `ArcPointer` share a reference count, which is incremented whenever the `ArcPointer` is copied and decremented whenever an instance is destroyed. When the reference count reaches zero, the stored value is destroyed and the allocated memory is freed. You can use `ArcPointer` to implement safe reference-semantic types. For example, in the following code snippet `SharedDict` uses an `ArcPointer` to store a dictionary. Copying an instance of `SharedDict` only copies the `ArcPointer`, not the dictionary, which is shared between all of the copies. ```python from memory import ArcPointer struct SharedDict: var attributes: ArcPointer[Dict[String, String]] fn __init__(out self): attributesDict: Dict[String, String] = {} self.attributes = ArcPointer(attributesDict) fn __copyinit__(out self, other: Self): self.attributes = other.attributes def __setitem__(mut self, key: String, value: String): self.attributes[][key] = value def __getitem__(self, key: String) -> String: return self.attributes[].get(key, default="") def main(): thing1 = SharedDict() thing2 = thing1 thing1["Flip"] = "Flop" print(thing2["Flip"]) ``` Note: The reference count is stored using an [`Atomic`](/mojo/stdlib/os/atomic/Atomic) value to ensure that updates to the reference count are thread-safe. However, Mojo doesn't currently enforce exclusive access across thread boundaries, so it's possible to form race conditions. ## UnsafePointer [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) is a low-level pointer that can access a block of contiguous memory locations, which might be uninitialized. It's analogous to a raw pointer in the C and C++ programming languages. `UnsafePointer` provides unsafe methods for initializing and destroying stored values, as well as for accessing the values once they're initialized. As the name suggests, `UnsafePointer` doesn't provide any memory safety guarantees, so you should reserve it for cases when none of the other pointer types will do the job. Here are some use cases where you might want to use an `UnsafePointer`: - Building a high-performance array-like structure, such as `List` or `Tensor`. A single `UnsafePointer` can access many values, and gives you a lot of control over how you allocate, use, and deallocate memory. Being able to access uninitialized memory means that you can preallocate a block of memory, and initialize values incrementally as they are added to the collection. - Interacting with external libraries including C++ and Python. You can use`UnsafePointer` to pass a buffer full of data to or from an external library. For more information, see [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers). --- ## Intro to value lifecycle So far, we've explained how Mojo allows you to build high-performance code that is memory safe *without* manually managing memory, using Mojo's [ownership model](/mojo/manual/values/ownership). However, Mojo is designed for [systems programming](https://en.wikipedia.org/wiki/Systems_programming), which often requires manual memory management for custom data types. So, Mojo lets you do that as you see fit. To be clear, Mojo has no reference counter and no garbage collector. Mojo also has no built-in data types with special privileges. All data types in the standard library (such as [`Bool`](/mojo/stdlib/builtin/bool/Bool), [`Int`](/mojo/stdlib/builtin/int/Int), and [`String`](/mojo/stdlib/collections/string/string/String)) are implemented as [structs](/mojo/manual/structs). What's great about the Mojo language is that it provides you these low-level tools for systems programming, but within a framework that helps you build things that are safe and easy to use from higher-level programs. That is, you can get under the hood and write all the "unsafe" code you want, but as long as you do so in accordance with Mojo's [value semantics](/mojo/manual/values/value-semantics), the programmer instantiating your type/object doesn't need to think about memory management at all, and the behavior will be safe and predictable, thanks to [value ownership](/mojo/manual/values/ownership). In summary, it's the responsibility of the type author to manage the memory and resources for each value type, by implementing specific lifecycle methods, such as the constructor, copy constructor, move constructor, and destructor, as necessary. Mojo doesn't create any constructors by default, although it does add a trivial, no-op destructor for types that don't define their own. In the following pages, we'll explain exactly how to define these lifecycle methods in accordance with value semantics so your types play nicely with value ownership. ## Lifecycles and lifetimes First, let's clarify some terminology: * The "lifecycle" of a value is defined by various [dunder methods](/mojo/manual/structs#special-methods) in a struct. Each lifecycle event is handled by a different method, such as the constructor (`__init__()`), the destructor (`__del__()`), the copy constructor (`__copyinit__()`), and the move constructor (`__moveinit__()`). All values that are declared with the same type have the same lifecycle. * The "lifetime" of a variable is defined by the span of time during program execution in which the variable is considered valid. The life of a variable begins when its value is initialized (via `__init__()`, `__copyinit__()` or `__moveinit__()`) and ends when the value is destroyed (`__del__()`), or consumed in some other way (for example, as part of a `__moveinit__()` call). No two values have the exact same lifetime, because every value is created and destroyed at a different point in time (even if the difference is imperceptible). :::note Origin type The concept of lifetimes is related to the `origin` type, a Mojo primitive used to track ownership. For most Mojo programming, you won't need to work with `origin` values directly. For information, see [Lifetimes, origins and references](/mojo/manual/values/lifetimes). ::: The life of a value in Mojo begins when a variable is initialized and continues up until the value is last used, at which point Mojo destroys it. Mojo destroys every value/object as soon as it's no longer used, using an “as soon as possible” (ASAP) destruction policy that runs after every sub-expression. The Mojo compiler takes care of releasing resources after last use when needed. As you might imagine, keeping track of a value's life can be difficult if a value is shared across functions many times during the life of a program. However, Mojo makes this predictable partly through its [value semantics](/mojo/manual/values/value-semantics) and [value ownership](/mojo/manual/values/ownership) (both prerequisite readings for the following sections). The final piece of the puzzle for lifetime management is the value lifecycle: every value (defined in a struct) needs to implement key lifecycle methods that define how a value is created and destroyed. --- ## Intro to value ownership A program is nothing without data, and all modern programming languages store data in one of two places: the call stack and the heap (also sometimes in CPU registers, but we won't get into that here). However, each language reads and writes data a bit differently—sometimes very differently. So in the following sections, we'll explain how Mojo manages memory in your programs and how this affects the way you write Mojo code. :::note For an alternate introduction to ownership in Mojo, check out our two-part blog post: [What ownership is really about: a mental model approach](https://www.modular.com/blog/what-ownership-is-really-about-a-mental-model-approach), and [Deep dive into ownership in Mojo](https://www.modular.com/blog/deep-dive-into-ownership-in-mojo). ::: ## Stack and heap overview In general, all modern programming languages divide a running program's memory into four segments: * Text. The compiled program. * Data. Global data, either initialized or uninitialized. * Stack. Local data, automatically managed during the program's runtime. * Heap. Dynamically-allocated data, managed by the programmer. The text and data segments are statically sized, but the stack and heap change size as the program runs. The *stack* stores data local to the current function. When a function is called, the program allocates a block of memory—a *stack frame*—that is exactly the size required to store the function's data, including any *fixed-size* local variables. When another function is called, a new stack frame is pushed onto the top of the stack. When a function is done, its stack frame is popped off the stack. Notice that we said only "*fixed-size* local values" are stored in the stack. Dynamically-sized values that can change in size at runtime are instead stored in the heap, which is a much larger region of memory that allows for dynamic memory allocation. Technically, a local variable for such a value is still stored in the call stack, but its value is a fixed-size pointer to the real value on the heap. Consider a Mojo string: it can be any length, and its length can change at runtime. So the Mojo `String` struct includes some statically-sized fields, plus a pointer to a dynamically-allocated buffer holding the actual string data. Another important difference between the heap and the stack is that the stack is managed automatically—the code to push and pop stack frames is added by the compiler. Heap memory, on the other hand, is managed by the programmer explicitly allocating and deallocating memory. You may do this indirectly—by using standard library types like `List` and `String`—or directly, using the [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) API. Values that need to outlive the lifetime of a function (such as an array that's passed between functions and should not be copied) are stored in the heap, because heap memory is accessible from anywhere in the call stack, even after the function that created it is removed from the stack. This sort of situation—in which a heap-allocated value is used by multiple functions—is where most memory errors occur, and it's where memory management strategies vary the most between programming languages. ## Memory management strategies Because memory is limited, it's important that programs remove unused data from the heap ("free" the memory) as quickly as possible. Figuring out when to free that memory is pretty complicated. Some programming languages try to hide the complexities of memory management from you by utilizing a "garbage collector" process that tracks all memory usage and deallocates unused heap memory periodically (also known as automatic memory management). A significant benefit of this method is that it relieves developers from the burden of manual memory management, generally avoiding more errors and making developers more productive. However, it incurs a performance cost because the garbage collector interrupts the program's execution, and it might not reclaim memory very quickly. Other languages require that you manually free data that's allocated on the heap. When done properly, this makes programs execute quickly, because there's no processing time consumed by a garbage collector. However, the challenge with this approach is that programmers make mistakes, especially when multiple parts of the program need access to the same memory—it becomes difficult to know which part of the program "owns" the data and must deallocate it. Programmers might accidentally deallocate data before the program is done with it (causing "use-after-free" errors), or they might deallocate it twice ("double free" errors), or they might never deallocate it ("leaked memory" errors). Mistakes like these and others can have catastrophic results for the program, and these bugs are often hard to track down, making it especially important that they don't occur in the first place. Mojo uses a third approach called "ownership" that relies on a collection of rules that programmers must follow when passing values. The rules ensure there is only one "owner" for a given value at a time. When a value's lifetime ends, Mojo calls its destructor, which is responsible for deallocating any heap memory that needs to be deallocated. In this way, Mojo helps ensure memory is freed, but it does so in a way that's deterministic and safe from errors such as use-after-free, double-free and memory leaks. Plus, it does so with a very low performance overhead. Mojo's value ownership model provides an excellent balance of programming productivity and strong memory safety. It only requires that you learn some new syntax and a few rules about how to share access to memory within your program. But before we explain the rules and syntax for Mojo's value ownership model, you first need to understand [value semantics](/mojo/manual/values/value-semantics). --- ## Introduction to layouts Mojo’s [`layout` package](/mojo/kernels/layout/) provides a number of APIs for working with dense multidimensional arrays, which simplify writing algorithms for handling linear algebra. This package includes the following main types: - The [`Layout`](/mojo/kernels/layout/layout/Layout) struct describes an arrangement of data in memory. A *layout* is a function that maps a set of logical coordinates (like (*x*, *y*) in a two-dimensional array) to a linear index value. Layouts can be hierarchical (for example, representing a 2D matrix that’s further subdivided into tiles). - [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) is a flexible tensor type that combines a `Layout` and a pointer to data. - The [`IntTuple`](/mojo/kernels/layout/int_tuple/IntTuple) struct is a hierarchical tuple type, where each element of the tuple can either be an integral value or a nested `IntTuple`. The `IntTuple` type is used extensively for defining and indexing layouts and layout tensors. :::tip Example code You can find most of the code examples on this page in the [public GitHub repo](https://github.com/modular/modular/tree/main/examples/mojo/layouts). Some of the concepts presented here can be a little hard to grasp from static examples, so we recommend downloading the example code and experimenting. ::: ## What’s a Layout? A layout is a function that maps a set of logical coordinates to a single linear index value. For example, a layout could describe a 2x4 row-major matrix, or a 6x6 column-major matrix. ```mojo from layout import Layout, print_layout var l2x4row_major = Layout.row_major(2, 4) var l6x6col_major = Layout.col_major(6, 6) ``` Layouts are made up of two tuples: shape and stride, where shape describes the logical coordinate space and the stride determines the mapping to the linear index value. A layout can be written as (*shape*:*stride*). For example, a contiguous vector of length 4 can be represented as (4:1): ![](../images/layout/1d-layout-with-strides.png#light) ![](../images/layout/1d-layout-with-strides-dark.png#dark) Figure 1. 1D layout (4:1) A 3x4 row-major layout can be represented as ((3, 4):(4, 1)). That is, the *shape* is 3x4 and the *strides* are 4 and 1. You can break this down into two sub-layouts or *modes*: a row mode and a column mode: 3 rows with a stride of 4 (3:4, the first numbers from each tuple) and 4 columns with a stride of 1 (4:1, the second numbers from each tuple). The [`print_layout()`](/mojo/kernels/layout/layout/print_layout) function generates an ASCII diagram of any 2D layout, showing the coordinates on the outside and the corresponding index values in the grid. ```mojo var l3x4row_major = Layout.row_major(3, 4) print_layout(l3x4row_major) ``` Output: ```plaintext ((3, 4):(4, 1)) 0 1 2 3 +----+----+----+----+ 0 | 0 | 1 | 2 | 3 | +----+----+----+----+ 1 | 4 | 5 | 6 | 7 | +----+----+----+----+ 2 | 8 | 9 | 10 | 11 | +----+----+----+----+ ``` The coordinate to index mapping is performed by calculating the dot product of the logical coordinates and the corresponding strides. For example, given the coordinates (*i, j*) and the layout shown above, the index value is $i*4 + j*1$. So coordinate (1, 1) maps to 5, as shown in the diagram. The following example shows how to use a `Layout` to convert between coordinates and index values. ```mojo var coords = IntTuple(1, 1) var idx = l3x4row_major(coords) print("index at coordinates (1, 1): ", idx) print("coordinates at index 7:", l3x4row_major.idx2crd(7)) ``` Output: ```plaintext index at coordinates (1, 1): 5 coordinates at index 7: (1, 3) ``` As this example shows, the layout is a function that takes a set of integer coordinates and returns a single integer (the linear index). The `Layout` struct also provides an [`idx2crd()`](/mojo/kernels/layout/layout/Layout#idx2crd) method that transforms a linear index into a set of logical coordinates. :::note Printing layouts You can use `print_layout()` to print a diagram of any 2D layout. You can pass *any* layout to the built-in `print()` function to print a string representation of the layout in the form of a (*shape*:*stride*) pair. ::: ### IntTuple: representing hierarchical shapes and strides A layout’s shape and stride are represented using the [`IntTuple`](/mojo/kernels/layout/int_tuple/IntTuple) type. Each element of an `IntTuple` is either an integer value or a nested `IntTuple`. You can create nested `IntTuples` using the `IntTuple` constructor: ```mojo var shape1 = IntTuple(4, IntTuple(2, 2)) ``` A layout’s shape and stride tuples must be *congruent*—that is, they need to have the same hierarchical structure: the tuples must have the same number of elements, and any elements that are nested tuples must also have the same number of elements. The [`int_tuple`](/mojo/kernels/layout/int_tuple/) package provides a number of functions for working with `IntTuple`. For example, it provides a [`congruent()`](/mojo/kernels/layout/int_tuple/congruent) function for testing the congruency of two tuples. ### Modes A layout has one or more *modes*, where a mode is a shape:stride pair. For example, the 1D vector layout (8:1) has a single mode: 8 elements with a stride of 1: ![](../images/layout/1d-layout.png#light) ![](../images/layout/1d-layout-dark.png#dark) Figure 2. 1D layout The 2D row-major matrix layout ((2, 4):(4, 1)) has two modes, 2:4 (the first numbers from each tuple) and 4:1 (the second numbers from each tuple). Taking them right to left, the second mode describes 4 columns with a stride of one. The first mode specifies that there are two of these groups with a stride of 4: ![](../images/layout/2d-layout-with-strides.png#light) ![](../images/layout/2d-layout-with-strides-dark.png#dark) Figure 3. 2D layout with strides In a column-major layout, the row number varies the fastest, so a column-major 2x4 matrix has the layout ((2, 4):(1, 2)) and looks like this: ![](../images/layout/2d-col-major-layout-with-strides.png#light) ![](../images/layout/2d-col-major-layout-with-strides-dark.png#dark) Figure 4. 2D column-major layout with strides A layout’s *rank* is the number of modes in its shape. A rank-1 (or 1D) layout describes a vector. A rank-2 layout describes a 2D matrix, and so on. A layout’s *size* is defined as the product of all of the modes in the layout’s shape. To put it another way, it’s the number of elements that the layout addresses: that is, the *domain* of the layout function. Modes can also be nested to represent more complicated strides along a dimension. For example, the layout (8:1) represents a 1D vector of 8 elements. ![](../images/layout/1d-layout.png#light) ![](../images/layout/1d-layout-dark.png#dark) Figure 5. 1D vector layout The layout (((4, 2):(1, 4))) is *also* a 1D vector of 8 elements. The extra set of parentheses indicates a nested or hierarchical mode. Instead of being represented by a single mode like 8:1, this layout’s single dimension is represented by the multi-mode (4, 2):(1, 4): ![](../images/layout/1d-multi-modal-layout.png#light) ![](../images/layout/1d-multi-modal-layout-dark.png#dark) Figure 6. 1D layout with nested modes Note that in the nested modes, there’s no notion of row and column. You can think of the first mode as the “inner” mode (defining a group) and the next mode as an “outer” mode (defining a repeat of the group) as shown above. A set of nested modes (a *multi-mode*) counts as a single mode when considering the parent layout’s rank. For example, the layouts (8:1) and (((4, 2):(1, 4))) are both rank-1 layouts. This gets more interesting when we move to two dimensions. Consider the following 2D layouts: ![](../images/layout/multi-modal-layout.png#light) ![](../images/layout/multi-modal-layout-dark.png#dark) Figure 7. Two 2D layouts Layouts A and B are both 2D matrix layouts with the same overall 2D shape, but with the elements in a different order. Layout B is *tiled*, so instead of being in row-major or column-major order, four consecutive indices are grouped into each 2x2 tile. This is sometimes called *tile-major order*. We can break this tiled layout into two modes, one for the rows and one for the columns: - Layout B has a row mode of (2, 2):(1, 4). We can further break this into two sub-modes: the inner mode, 2:1, defines a group of two rows with a stride of one. The outer mode, 2:4, specifies that the group occurs twice with a stride of 4. - The column has the mode (2, 2):(2, 8). Once again we can break this into two sub-modes: (2:2) defines a group of two columns with a stride of two, and the group occurs twice with a stride of 8 (2:8). If all of those modes are swimming before your eyes, take a moment to study the figure and trace out the strides yourself. ### Coordinates Coordinates for layouts can be written in the same format as the shape tuple. For example, coordinates for layout B above can be written ((*i, j*), (*k, l*)). However, this layout can also be addressed as a logical 2D matrix, just like layout A. So ((0, 1), (0, 1)) and (2, 2) are both valid coordinates that map to the same index. In fact, this is true for any layout: the layout can be addressed with 1D or 2D coordinates as well as its “natural” coordinates. When mapping coordinates, the dimensions are traversed in *colexicographical* order (that is, a generalized column-major order, where the leftmost coordinate varies fastest). Table 1 shows how different 1D and 2D coordinates map to the “natural” coordinates of the ((2, 2), (2, 2)) shape shown above: | 1D | 2D | Natural | | ----- | :---- | :---- | | 0 | (0, 0) | ((0, 0), (0, 0)) | | 1 | (1, 0) | ((1, 0), (0, 0)) | | 2 | (2, 0) | ((0, 1), (0, 0)) | | 3 | (3, 0) | ((1, 1), (0, 0)) | | 4 | (0, 1) | ((0, 0), (1, 0)) | | 5 | (1, 1) | ((1, 0), (1, 0)) | | 6 | (2, 1) | ((0, 1), (1, 0)) | | 7 | (3, 1) | ((1, 1), (1, 0)) | | 8 | (0, 2) | ((0, 0), (0, 1)) | | ... | ... | ... | | 15 | (3, 3) | ((1, 1), (1, 1)) | Table 1. Mapping between 1D, 2D, and natural coordinates ## Making layouts There are multiple ways to create layouts. The [`row_major()`](/mojo/kernels/layout/layout/Layout/#row_major) and [`col_major()`](/mojo/kernels/layout/layout/Layout/#col_major) static methods are probably the simplest ways to create a layout. The `row_major()` method creates a generalized row-major layout: that is, the rightmost coordinate varies the fastest. The `col_major()` method creates a generalized column-major layout, where the leftmost coordinate varies the fastest. ```mojo print(Layout.row_major(4, 4, 4)) print(Layout.col_major(4, 4, 4)) ``` Output: ```plaintext ((4, 4, 4):(16, 4, 1)) ((4, 4, 4):(1, 4, 16)) ``` If you know the shape and strides in advance, you can construct an arbitrarily complex layout using the `Layout` constructor. For example: ```mojo var tiled_layout = Layout( IntTuple(IntTuple(3, 2), IntTuple(2, 5)), # shape IntTuple(IntTuple(1, 6), IntTuple(3, 12)) # strides ) print_layout(tiled_layout) ``` Output: ```plaintext (((3, 2), (2, 5)):((1, 6), (3, 12))) 0 1 2 3 4 5 6 7 8 9 +----+----+----+----+----+----+----+----+----+----+ 0 | 0 | 3 | 12 | 15 | 24 | 27 | 36 | 39 | 48 | 51 | +----+----+----+----+----+----+----+----+----+----+ 1 | 1 | 4 | 13 | 16 | 25 | 28 | 37 | 40 | 49 | 52 | +----+----+----+----+----+----+----+----+----+----+ 2 | 2 | 5 | 14 | 17 | 26 | 29 | 38 | 41 | 50 | 53 | +----+----+----+----+----+----+----+----+----+----+ 3 | 6 | 9 | 18 | 21 | 30 | 33 | 42 | 45 | 54 | 57 | +----+----+----+----+----+----+----+----+----+----+ 4 | 7 | 10 | 19 | 22 | 31 | 34 | 43 | 46 | 55 | 58 | +----+----+----+----+----+----+----+----+----+----+ 5 | 8 | 11 | 20 | 23 | 32 | 35 | 44 | 47 | 56 | 59 | +----+----+----+----+----+----+----+----+----+----+ ``` The result is a 6x10 tile-major layout. The layout is indexed vertically in 2 groups of 3 rows (3, 2) : (1, 6) ( and horizontally in 5 groups of 2 columns (2, 5):(3, 12). Alternatively, you can think of this as a layout consisting of 3x2 column-major tiles ((3, 2):(1, 3)) that are arranged into two rows of 5, ((2, 5):(6, 12)). The `Layout` constructor works fine if you know the shape and strides in advance, but calculating the strides for a complicated layout isn’t always intuitive. An easier way to generate this layout is the [`tile_to_shape()`](/mojo/kernels/layout/layout/tile_to_shape) function. This takes a layout (representing the tile) and a final shape to tile to: ```mojo var tts = tile_to_shape(Layout.col_major(3, 2), IntTuple(6, 10)) print_layout(tts) ``` Output: ```plaintext (((3, 2), (2, 5)):((1, 6), (3, 12))) 0 1 2 3 4 5 6 7 8 9 +----+----+----+----+----+----+----+----+----+----+ 0 | 0 | 3 | 12 | 15 | 24 | 27 | 36 | 39 | 48 | 51 | +----+----+----+----+----+----+----+----+----+----+ 1 | 1 | 4 | 13 | 16 | 25 | 28 | 37 | 40 | 49 | 52 | +----+----+----+----+----+----+----+----+----+----+ 2 | 2 | 5 | 14 | 17 | 26 | 29 | 38 | 41 | 50 | 53 | +----+----+----+----+----+----+----+----+----+----+ 3 | 6 | 9 | 18 | 21 | 30 | 33 | 42 | 45 | 54 | 57 | +----+----+----+----+----+----+----+----+----+----+ 4 | 7 | 10 | 19 | 22 | 31 | 34 | 43 | 46 | 55 | 58 | +----+----+----+----+----+----+----+----+----+----+ 5 | 8 | 11 | 20 | 23 | 32 | 35 | 44 | 47 | 56 | 59 | +----+----+----+----+----+----+----+----+----+----+ ``` A variation on `tile_to_shape()` is the [`blocked_product()`](/mojo/kernels/layout/layout/blocked_product) function. The main difference is that where `tile_to_shape()` takes an output *shape*, `blocked_product()` takes a *tiler* layout: essentially, every element in the tiler layout is replaced by a tile. The following example generates the same tiled layout using `blocked_product()`. It also prints out the two input layouts. ```mojo # Define 2x3 tile var tile = Layout.col_major(3, 2) # Define a 2x5 tiler var tiler = Layout.col_major(2, 5) var blocked = blocked_product(tile, tiler) print("Tile:") print_layout(tile) print("\nTiler:") print_layout(tiler) print("\nTiled layout:") print(blocked) ``` Output: ```plaintext Tile: ((3, 2):(1, 3)) 0 1 +---+---+ 0 | 0 | 3 | +---+---+ 1 | 1 | 4 | +---+---+ 2 | 2 | 5 | +---+---+ Tiler: ((2, 5):(1, 2)) 0 1 2 3 4 +----+----+----+----+----+ 0 | 0 | 2 | 4 | 6 | 8 | +----+----+----+----+----+ 1 | 1 | 3 | 5 | 7 | 9 | +----+----+----+----+----+ Tiled layout: (((3, 2), (2, 5)):((1, 6), (3, 12))) ``` As you can see, `blocked_product()` combines two simple layouts to generate a more complex one. Finally, if you know the *shape* you want and the *order* in which you want to iterate through the dimensions, you can use the [`make_ordered_layout()`](/mojo/kernels/layout/layout/make_ordered_layout) function. For example, the following example is yet one more way to generate the previous tiled layout: ```mojo var ordered = make_ordered_layout( IntTuple(IntTuple(3, 2), IntTuple(2, 5)), # shape IntTuple(IntTuple(0, 2), IntTuple(1, 3)) # order ) print(ordered) ``` Output: ```plaintext (((3, 2), (2, 5)):((1, 6), (3, 12))) ``` The generated layout's strides follow the same ordering as `order`—that is, the dimension with the smallest corresponding `order` value has the smallest stride value, and so on. The strides are computed such that the layout is dense—that is, the logical multidimensional array is contiguous. ## Non-contiguous layouts All of the examples so far have been dense layouts, where all of the elements are contiguous in memory. However, layouts can also describe sparse logical arrays. For example, a (4:2) layout is a sparse 1D array: ![](../images/layout/1d-sparse-layout.png#light) ![](../images/layout/1d-sparse-layout-dark.png#dark) Figure 8. 1D sparse layout (4:2) A layout’s *cosize* is the size of the layout’s codomain, which you can think of as the size of the smallest contiguous array that can contain all of the layout’s elements. The cosize is the largest linear index value generated by the layout plus 1. So in the example in Figure 9, the layout has a size of 4, but a cosize of 7. --- ## IntTuple `struct IntTuple[origin: ImmutableOrigin = {}]` A hierarchical, nested tuple of integers with efficient memory management. IntTuple provides a flexible data structure for representing multi-dimensional shapes, indices, and other nested integer collections. It supports both flat and hierarchical representations with efficient memory sharing. This structure is fundamental for tensor operations, layout specifications, and dimension handling in high-performance computing contexts. ## Parameters * ​origin (`ImmutableOrigin`): Origin tracking for memory safety. Defaults to the current origin. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Intable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `MinimumValue` `alias MinimumValue = -65534` Minimum allowed value for integers in an `IntTuple`. This constant defines the lower bound for integer values that can be stored directly in an `IntTuple`. Values below this threshold are reserved for internal use to represent structural information like sub-tuple offsets. ## Methods ### `__init__` `__init__(out self)` Initialize an empty IntTuple. Creates an `IntTuple` with zero elements, which can be used as a starting point for building tuples incrementally with `append` or `extend`. Performance: * Minimal allocation (just a single element for length). * Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. `__init__(out self, *, num_elems: Int)` Initialize an `IntTuple` with a specified number of uninitialized elements. Creates an `IntTuple` with space for the specified number of elements, but does not initialize the elements themselves. Note: Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​num\_elems (`Int`): The number of elements to allocate space for. `@implicit` `__init__(out self, *elements: Int)` Initialize an `IntTuple` with a variadic list of integers. Creates an `IntTuple` containing the provided integer values. This constructor is implicit, allowing direct conversion from integer lists. **Args:** * ​\*elements (`Int`): Variable number of integer values to store in the tuple. `__init__(out self, elements: VariadicList[Int])` Initialize an `IntTuple` with a list of integers. Creates an `IntTuple` containing the provided integer values. This constructor is implicit, allowing direct conversion from integer lists. Notes: * Pre-allocates exact memory needed for efficiency. * Validates that all values are above `MinimumValue`. If any value is less than `MinimumValue`, aborts with an error message. * Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​elements (`VariadicList[Int]`): List of integer values to store in the tuple. `@implicit` `__init__(out self, value: Int)` Initialize an `IntTuple` with a single integer value. Creates an `IntTuple` containing a single integer element. **Args:** * ​value (`Int`): The integer value to store in the tuple. `__init__(out self, *elements: IntTuple[origin], *, __list_literal__: Tuple[] = Tuple())` Initialize an `IntTuple` with nested IntTuples. Creates a hierarchical `IntTuple` containing the provided `IntTuple` elements, preserving their nested structure. **Args:** * ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` values to store in the tuple. * ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for list literals. `__init__(out self, *, non_owned: IntArray)` Initialize an `IntTuple` with a non-owned `IntArray`. Creates an `IntTuple` that uses the provided `IntArray` as its storage without taking ownership. This allows creating views into existing `IntTuple` data without copying. **Args:** * ​non\_owned (`IntArray`): The `IntArray` to use as storage without taking ownership. `__init__(out self, existing: Self, rng: _StridedRange)` Initialize an `IntTuple` as a slice of an existing `IntTuple`. Creates a new `IntTuple` containing only the elements from the existing `IntTuple` that are specified by the range. Notes: * Preserves nested structure of elements in the slice. * Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled. **Args:** * ​existing (`Self`): The source `IntTuple` to slice from. * ​rng (`_StridedRange`): The range of indices to include in the new `IntTuple`. `__init__(out self, dimlist: DimList)` Initialize an `IntTuple` from a DimList. Creates an `IntTuple` containing the dimensions from a DimList, handling both defined and undefined dimensions appropriately. Notes: * Converts undefined dimensions to `UNKNOWN_VALUE`. * Validates that all values are above `MinimumValue`. If any value is less than `MinimumValue`, aborts with an error message. **Args:** * ​dimlist (`DimList`): The DimList containing dimension information. `@implicit` `__init__(out self, zipper: _zip[origin, 2])` Initialize an `IntTuple` from a zip iterator. Creates an `IntTuple` by appending each element from the zip iterator. This constructor is implicit, allowing direct conversion from zip iterators. Note: This implementation is not optimized and may be improved in future versions. **Args:** * ​zipper (`_zip[origin, 2]`): A zip iterator containing pairs of elements to append. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Initialize by copying an existing `IntTuple`. Creates a deep copy of the provided `IntTuple`, copying all its data into newly allocated memory. Note: There is a Mojo bug where this method unnecessarily propagates the origin of self to the new copy. **Args:** * ​existing (`Self`): The `IntTuple` to copy from. ### `__getitem__` `__getitem__(self, _idx: Int) -> IntTuple[self]` Retrieves an element at the specified index from the `IntTuple`. Supports negative indexing (e.g., `-1` for the last element). Notes: If index validation is enabled and the index is out of bounds, aborts with an error message. **Args:** * ​\_idx (`Int`): The index of the element to retrieve. **Returns:** An `IntTuple` containing either a single value or a sub-tuple. `__getitem__(self, span: Slice) -> Self` Retrieves a slice of elements from the `IntTuple`. Creates a new `IntTuple` containing the elements specified by the slice. **Args:** * ​span (`Slice`): A slice object specifying the range of elements to retrieve. **Returns:** A new `IntTuple` containing the specified elements. ### `__lt__` `__lt__(self, rhs: IntTuple[origin]) -> Bool` Compare two `IntTuple`s lexicographically. This function performs element-wise comparison of two `IntTuple`s and determines if the first is lexicographically less than the second. It compares corresponding elements until it finds a pair where the elements differ. Example: ```mojo from layout.int_tuple import IntTuple var tuple1 = IntTuple(1, 2, 3) var tuple2 = IntTuple(1, 2, 4) var result = tuple1 rhs (`IntTuple[origin]`): The other `IntTuple` to compare. **Returns:** True if `self` is lexicographically less than `rhs`, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Equality operator for `IntTuple`. **Args:** * ​other (`Self`): The `IntTuple` to compare with. **Returns:** True if the `IntTuple`s are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Inequality operator for `IntTuple`. **Args:** * ​other (`Self`): The `IntTuple` to compare with. **Returns:** True if the `IntTuple`s are not equal, False otherwise. ### `elements_size` `static elements_size[origin: ImmutableOrigin](elements: VariadicListMem[IntTuple[origin], origin, is_owned]) -> Int` Calculate the total storage size needed for a list of IntTuples. Computes the sum of sizes for all elements, accounting for both direct integer values and nested sub-tuples. **Parameters:** * ​origin (`ImmutableOrigin`): Origin of the elements in the `IntTuple`. **Args:** * ​elements (`VariadicListMem[IntTuple[origin], origin, is_owned]`): List of `IntTuple` elements to measure. **Returns:** The total storage size required for all elements. `static elements_size[origin: ImmutableOrigin, n: Int](elements: InlineArray[Pointer[IntTuple, origin], n], idx: Int) -> Int` Calculate the total storage size needed for IntTuples at a specific index. Computes the sum of sizes for all elements at the given index in an array of `IntTuple` pointers. **Parameters:** * ​origin (`ImmutableOrigin`): Origin tracking for memory safety. * ​n (`Int`): Size of the inline array. **Args:** * ​elements (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to `IntTuple`s. * ​idx (`Int`): Index to access in each `IntTuple`. **Returns:** The total storage size required for all elements at the specified index. ### `owned_copy` `owned_copy(self) -> IntTuple` Create a deep copy of this `IntTuple` with its own memory ownership. This method creates a completely independent copy of the `IntTuple` with newly allocated memory. Unlike `__copyinit__`, this method can be called on an existing instance to create a separate copy. Example: ```mojo from layout import IntTuple var original = IntTuple(1, 2, 3) var copy = original.owned_copy() # Modifying copy will not affect original ``` . **Returns:** A new `IntTuple` containing the same data as this one but with independent memory ownership. ### `replace_entry` `replace_entry(self, idx: Int, value: IntTuple[origin]) -> IntTuple` Replace an entry in the tuple with another `IntTuple`. Creates a new `IntTuple` with the element at the specified index replaced by the provided `IntTuple`. Note: If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled, aborts with an error message. **Args:** * ​idx (`Int`): The index of the element to replace. * ​value (`IntTuple[origin]`): The `IntTuple` to insert at the specified index. **Returns:** A new `IntTuple` with the replacement applied. `replace_entry(mut self, idx: Int, *, int_value: Int)` Replace an integer value at the specified index in-place. Directly modifies the tuple by replacing the integer value at the given index. This is more efficient than creating a new tuple when only a single value needs to be changed. Note: If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled, aborts with an error message. **Args:** * ​idx (`Int`): The index of the element to replace. * ​int\_value (`Int`): The integer value to insert at the specified index. ### `count_values` `count_values(self) -> Int` Count the total number of integer values in this tuple hierarchy. Recursively traverses the nested tuple structure and counts all integer values. This is useful for determining the size needed for flattened representations. Note: For a flat tuple, this will return the same value as `len(self)`. For nested tuples, it counts all leaf integer values. **Returns:** The total count of integer values in this tuple and all nested tuples. ### `flatten` `flatten(self) -> IntTuple` Flatten a nested `IntTuple` into a single-level `IntTuple`. This function converts a hierarchical `IntTuple` structure into a flat sequence of integer values, preserving the order of elements. **Returns:** A new `IntTuple` containing all integer values in a flat structure. ### `all_known` `all_known(self) -> Bool` Check if all values in this tuple hierarchy are known (not `UNKNOWN_VALUE`). Recursively traverses the nested tuple structure and checks if any value is equal to `UNKNOWN_VALUE`. **Returns:** True if all values in this tuple and nested tuples are known, False if any value is `UNKNOWN_VALUE`. ### `append` `append(mut self, *elements: IntTuple[origin])` Append one or more `IntTuple` elements to this tuple. This method modifies the tuple in-place by adding the provided elements to the end of the tuple. It handles both value tuples and nested tuples. Notes: * This operation requires reallocating the underlying `IntArray` storage to accommodate the new elements, which may impact performance for large tuples. * Aborts if called on a non-owning (sub-tuple) instance. **Args:** * ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` objects to append to this tuple. ### `extend` `extend(mut self, tuple: IntTuple[origin])` Extends this tuple by appending all elements from another tuple. This method modifies the tuple in-place by adding all elements from the provided tuple to the end of this tuple. It efficiently handles both value elements and nested tuples. Notes: * This operation requires reallocating the underlying `IntArray` storage to accommodate the new elements, which may impact performance for large tuples. * Aborts if called on a non-owning (sub-tuple) instance. * If the input tuple is empty, this method returns without making any changes. **Args:** * ​tuple (`IntTuple[origin]`): The `IntTuple` whose elements will be appended to this tuple. ### `size` `size(self) -> Int` Returns the total size of the `IntTuple` in memory. For owning tuples, returns the size of the underlying `IntArray`. For non-owning tuples, calculates the size recursively. **Returns:** The total size in memory units. ### `tuple_size` `static tuple_size(data: IntArray) -> Int` Recursively calculates the size of a tuple represented by an `IntArray`. This method traverses the tuple structure, accounting for both direct values and nested sub-tuples to compute the total memory footprint. **Args:** * ​data (`IntArray`): `IntArray` containing the tuple data. **Returns:** The total size of the tuple in memory units. ### `validate_structure` `validate_structure(self)` Validates the internal structure of the `IntTuple`. Ensures that the actual size of the underlying data matches the computed size based on the tuple's structure. This helps detect memory corruption or implementation errors. Aborts execution with an error message if validation fails. ### `__len__` `__len__(self) -> Int` Returns the number of elements in the `IntTuple`. This is the logical length of the tuple, not its memory size. **Returns:** The number of elements in the tuple. ### `__iter__` `__iter__(self) -> _IntTupleIter[self, origin]` Returns an iterator over the elements of the `IntTuple`. This enables iteration through the tuple using for-loops. **Returns:** An iterator object for this `IntTuple`. ### `is_value` `is_value(self) -> Bool` Determines if this `IntTuple` represents a single value rather than a tuple. **Returns:** True if this `IntTuple` contains exactly one element that is a value, False otherwise. `is_value(self, i: Int) -> Bool` Determines if the element at the specified index is a value rather than a tuple. Notes: If index validation is enabled and the index is out of bounds, aborts with an error message. **Args:** * ​i (`Int`): The index of the element to check. **Returns:** True if the element at index i is a value, False if it's a tuple. ### `is_tuple` `is_tuple(self) -> Bool` Determines if this `IntTuple` represents a tuple rather than a single value. **Returns:** True if this `IntTuple` is a tuple (not a single value), False otherwise. `is_tuple(self, i: Int) -> Bool` Determines if the element at the specified index is a tuple rather than a value. Notes: This is the complement of is\_value(i). **Args:** * ​i (`Int`): The index of the element to check. **Returns:** True if the element at index i is a tuple, False if it's a value. ### `value` `value(self) -> Int` Retrieves the value of this `IntTuple` if it represents a single value. This method should only be called if `is_value()` returns True. **Returns:** The integer value stored in this `IntTuple`. `value(self, i: Int) -> Int` Retrieves the value of the element at the specified index. This method should only be called if `is_value(i)` returns True. Notes: If the element is not a value, the behavior is undefined. **Args:** * ​i (`Int`): The index of the element to retrieve. **Returns:** The integer value stored at the specified index. ### `tuple` `tuple(ref self) -> ref [self] Self` Returns a reference to this `IntTuple` as a tuple. Notes: This method is used to access the current `IntTuple` as a tuple without creating a copy of the data. **Returns:** A reference to this `IntTuple` to avoid unnecessary copying. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes a string representation of this `IntTuple` to the provided writer. Notes: For single values, writes just the value. For tuples, writes a comma-separated list of elements enclosed in parentheses. **Parameters:** * ​W (`Writer`): A type that conforms to the Writer trait. **Args:** * ​writer (`W`): The writer to output the string representation to. ### `__str__` `__str__(self) -> String` Returns a string representation of this `IntTuple`. **Returns:** A string representation of the `IntTuple`, using the `write_to` method. ### `is_equal` `static is_equal(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Compares two `IntTuple`s for equality. Notes: Handles nested tuples and special cases where a single-element tuple is equivalent to its contained value. **Args:** * ​a (`IntTuple[origin]`): The first `IntTuple` to compare. * ​b (`IntTuple[origin]`): The second `IntTuple` to compare. **Returns:** True if the `IntTuple`s are equal in structure and values, False otherwise. ### `__repr__` `__repr__(self) -> String` Returns a string representation of this `IntTuple` for debugging. **Returns:** A string representation of the `IntTuple`, same as `__str__`. ### `__int__` `__int__(self) -> Int` Converts this `IntTuple` to an integer. This method should only be called if `is_value()` returns True. Notes: If the `IntTuple` is not a single value, the behavior is undefined. **Returns:** The integer value stored in this `IntTuple`. --- ## io Provides utilities for working with input/output. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`input`](/mojo/stdlib/builtin/io/input): Reads a line of input from the user. * [​`print`](/mojo/stdlib/builtin/io/print): Prints elements to the text stream. Each element is separated by `sep` and followed by `end`. --- ## IO `@register_passable(trivial)` `struct IO` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `FusedInput` `alias FusedInput = IO(2)` ### `FusedOutput` `alias FusedOutput = IO(3)` ### `Input` `alias Input = IO(1)` ### `Output` `alias Output = IO(0)` ### `Unknown` `alias Unknown = IO(-1)` ## Methods ### `__init__` `__init__(value: Int) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## io_spec ## Aliases ### `FusedInput` `alias FusedInput = IOSpec()` ### `FusedOutput` `alias FusedOutput = IOSpec()` ### `Input` `alias Input = IOSpec()` ### `IOUnknown` `alias IOUnknown = IOSpec()` ### `MutableInput` `alias MutableInput = IOSpec()` ### `Output` `alias Output = IOSpec()` ## Structs * [​`IO`](/max/api/mojo/tensor/io_spec/IO): * [​`IOSpec`](/max/api/mojo/tensor/io_spec/IOSpec): Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input. --- ## IOSpec `@register_passable(trivial)` `struct IOSpec[mut: Bool, input: IO]` Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input. ```mojo Input == IOSpec[False, IO.Input]() Output == IOSpec[True, IO.Output]() MutableInput == IOSpec[True, IO.Input]() FusedInput == IOSpec[False, IO.FusedInput]() FusedOutput == IOSpec[True, IO.FusedOutput]() ``` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` --- ## iota `iota[dtype: DType, width: Int](offset: SIMD[dtype, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]` Creates a SIMD vector containing an increasing sequence, starting from offset. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​offset (`SIMD[dtype, 1]`): The value to start the sequence at. Default is zero. **Returns:** An increasing sequence of values, starting from offset. `iota[dtype: DType, //](buff: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], len: Int, offset: Int = 0)` Fill the buffer with numbers ranging from offset to offset + len - 1, spaced by 1. The function doesn't return anything, the buffer is updated inplace. **Parameters:** * ​dtype (`DType`): DType of the underlying data. **Args:** * ​buff (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The buffer to fill. * ​len (`Int`): The length of the buffer to fill. * ​offset (`Int`): The value to fill at index 0. `iota[dtype: DType, //](mut v: List[SIMD[dtype, 1], hint_trivial_type], offset: Int = 0)` Fill a list with consecutive numbers starting from the specified offset. **Parameters:** * ​dtype (`DType`): DType of the underlying data. **Args:** * ​v (`List[SIMD[dtype, 1], hint_trivial_type]`): The list to fill with numbers. * ​offset (`Int`): The starting value to fill at index 0. `iota(mut v: List[Int, hint_trivial_type], offset: Int = 0)` Fill a list with consecutive numbers starting from the specified offset. **Args:** * ​v (`List[Int, hint_trivial_type]`): The list to fill with numbers. * ​offset (`Int`): The starting value to fill at index 0. --- ## irfft Inverse real FFT kernel using cuFFT. ## Functions * [​`irfft`](./irfft): Compute the inverse real FFT of the input tensor. --- ## irfft `irfft[input_rank: Int, input_type: DType, output_type: DType](input: NDBuffer[input_type, input_rank, origin], output: NDBuffer[output_type, input_rank, origin], n: Int, ctx: DeviceContext)` Compute the inverse real FFT of the input tensor. Currently, only applies it to the last dimension. **Args:** * ​input (`NDBuffer[input_type, input_rank, origin]`): Complex input tensor (NDBuffer). * ​output (`NDBuffer[output_type, input_rank, origin]`): Real output tensor (NDBuffer). * ​n (`Int`): Output signal size (if ctx (`DeviceContext`): Device context. --- ## is_32bit `is_32bit[target: target = _current_target()]() -> Bool` Returns True if the maximum integral value is 32 bit. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the maximum integral value is 32 bit, False otherwise. --- ## is_64bit `is_64bit[target: target = _current_target()]() -> Bool` Returns True if the maximum integral value is 64 bit. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the maximum integral value is 64 bit, False otherwise. --- ## is_absolute `is_absolute[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if `path` is an absolute path name. On Unix, that means it begins with a slash. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to check. **Returns:** Return `True` if path is an absolute path name. --- ## is_amd_gpu `is_amd_gpu() -> Bool` Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise. **Returns:** True if the triple target is amdgpu and False otherwise. --- ## is_apple_m1 `is_apple_m1() -> Bool` Returns True if the host system is an Apple M1 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M1 with AMX support and False otherwise. --- ## is_apple_m2 `is_apple_m2() -> Bool` Returns True if the host system is an Apple M2 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M2 with AMX support and False otherwise. --- ## is_apple_m3 `is_apple_m3() -> Bool` Returns True if the host system is an Apple M3 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M3 with AMX support and False otherwise. --- ## is_apple_m4 `is_apple_m4() -> Bool` Returns True if the host system is an Apple M4 with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple M4 with AMX support and False otherwise. --- ## is_apple_silicon `is_apple_silicon() -> Bool` Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False. **Returns:** True if the host system is an Apple Silicon with AMX support and False otherwise. --- ## is_big_endian `is_big_endian[target: target = _current_target()]() -> Bool` Returns True if the host endianness is big and False otherwise. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the host target is big endian and False otherwise. --- ## is_compile_time `is_compile_time() -> Bool` Returns true if the current code is executed at compile time, false otherwise. **Returns:** A boolean value indicating whether the code is being compiled. --- ## is_contiguous_dim `is_contiguous_dim(layout: Layout, dim: Int) -> Bool` Checks if a flat layout is contiguous in a specific dimension. This function checks if a flat layout is contiguous in a specified dimension, considering both positive strides and zero strides with a single element. The latter case is necessary for coalesced layouts. **Args:** * ​layout (`Layout`): The layout to check. * ​dim (`Int`): The dimension to check. **Returns:** True if the layout is contiguous in the specified dimension, False otherwise. --- ## is_cpu `is_cpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool` Checks if the target is a CPU (compile-time version). **Parameters:** * ​target (`StringSlice[$1]`): Target string to check. **Returns:** True if the target is a CPU, False otherwise. `is_cpu(target: StringSlice[origin]) -> Bool` Checks if the target is a CPU (runtime version). **Args:** * ​target (`StringSlice[origin]`): Target string to check. **Returns:** True if the target is a CPU, False otherwise. --- ## is_defined `is_defined[name: StringSlice[StaticConstantOrigin]]() -> Bool` Return true if the named value is defined. **Parameters:** * ​name (`StringSlice[StaticConstantOrigin]`): The name to test. **Returns:** True if the name is defined. --- ## is_flat `is_flat(t: IntTuple[origin]) -> Bool` Check if an `IntTuple` is flat. This function checks if the `IntTuple` is flat, meaning it has no nested elements. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to check. **Returns:** True if the `IntTuple` is flat, False otherwise. --- ## is_gpu `is_gpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool` Checks if the target is a GPU (compile-time version). **Parameters:** * ​target (`StringSlice[$1]`): Target string to check. **Returns:** True if the target is a GPU, False otherwise. `is_gpu(target: StringSlice[origin]) -> Bool` Checks if the target is a GPU (runtime version). **Args:** * ​target (`StringSlice[origin]`): Target string to check. **Returns:** True if the target is a GPU, False otherwise. --- ## is_gpu `is_gpu() -> Bool` Returns True if the target triple is GPU and False otherwise. **Returns:** True if the triple target is GPU and False otherwise. --- ## is_int `is_int(t: IntTuple[origin]) -> Bool` Check if an `IntTuple` represents a single integer value. This function determines whether the given `IntTuple` contains a single integer value rather than a nested tuple structure. Example: ```mojo from layout.int_tuple import is_int, IntTuple var single_value = IntTuple(5) var nested_tuple = IntTuple(1, 2, 3) var result1 = is_int(single_value) # Returns True var result2 = is_int(nested_tuple) # Returns False ``` . **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to check. **Returns:** True if the `IntTuple` contains a single integer value, False if it's a nested tuple. --- ## is_int `is_int[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool` Determines if a `RuntimeTuple` represents a scalar integer value. This function checks if the `RuntimeTuple` holds a single scalar value rather than a tuple structure with multiple elements. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check. **Returns:** True if the `RuntimeTuple` represents a scalar integer, False otherwise. --- ## is_little_endian `is_little_endian[target: target = _current_target()]() -> Bool` Returns True if the host endianness is little and False otherwise. **Parameters:** * ​target (`target`): The target architecture. **Returns:** True if the host target is little endian and False otherwise. --- ## is_neoverse_n1 `is_neoverse_n1() -> Bool` Returns True if the host system is a Neoverse N1 system, otherwise returns False. **Returns:** True if the host system is a Neoverse N1 system and False otherwise. --- ## is_nvidia_gpu `is_nvidia_gpu() -> Bool` Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise. **Returns:** True if the triple target is cuda and False otherwise. `is_nvidia_gpu[subarch: StringSlice[StaticConstantOrigin]]() -> Bool` Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` and we are compiling for the specified sub-architecture and False otherwise. **Parameters:** * ​subarch (`StringSlice[StaticConstantOrigin]`): The subarchitecture (e.g. sm\_80). **Returns:** True if the triple target is cuda and False otherwise. --- ## is_profiling_disabled `is_profiling_disabled[type: TraceCategory, level: TraceLevel]() -> Bool` Returns False if the profiling is enabled for that specific type and level and True otherwise. **Parameters:** * ​type (`TraceCategory`): The trace category to check. * ​level (`TraceLevel`): The trace level to check. **Returns:** True if profiling is disabled for the specified type and level. --- ## is_profiling_enabled `is_profiling_enabled[type: TraceCategory, level: TraceLevel]() -> Bool` Returns True if the profiling is enabled for that specific type and level and False otherwise. **Parameters:** * ​type (`TraceCategory`): The trace category to check. * ​level (`TraceLevel`): The trace level to check. **Returns:** True if profiling is enabled for the specified type and level. --- ## is_row_major `is_row_major[rank: Int](layout: Layout) -> Bool` Checks if a layout has row-major ordering for the specified rank. A row-major layout has strides that decrease from left to right, with the rightmost dimension having a stride of 1. **Parameters:** * ​rank (`Int`): The expected rank of the layout. **Args:** * ​layout (`Layout`): The layout to check. **Returns:** True if the layout has row-major ordering for the specified rank, False otherwise. --- ## is_triple `is_triple[: string, //, name: StringLiteral[$0], target: target = _current_target()]() -> Bool` Returns True if the target triple of the compiler matches the input and False otherwise. **Parameters:** * ​name (`StringLiteral[$0]`): The name of the triple value. * ​target (`target`): The triple value to be checked against. **Returns:** True if the triple matches and False otherwise. --- ## is_tuple `is_tuple(t: IntTuple[origin]) -> Bool` Check if an `IntTuple` represents a nested tuple. This function determines whether the given `IntTuple` contains nested elements rather than a single integer value. It is the complement of the `is_int` function. Example: ```mojo from layout.int_tuple import is_tuple, IntTuple var single_value = IntTuple(5) var nested_tuple = IntTuple(1, 2, 3) var result1 = is_tuple(single_value) # Returns False var result2 = is_tuple(nested_tuple) # Returns True ``` . **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to check. **Returns:** True if the `IntTuple` contains nested elements, False if it's a single integer value. --- ## is_tuple `is_tuple[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool` Determines if a `RuntimeTuple` represents a tuple rather than a scalar value. This function checks the structure of the underlying IntTuple to determine if it represents a tuple with multiple elements or a single scalar value. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check. **Returns:** True if the `RuntimeTuple` represents a tuple, False if it represents a scalar. --- ## is_valid_target `is_valid_target[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool` Checks if the target is valid (compile-time version). **Parameters:** * ​target (`StringSlice[$1]`): Target string to check. **Returns:** True if the target is valid (CPU or GPU), False otherwise. `is_valid_target(target: StringSlice[origin]) -> Bool` Checks if the target is valid (runtime version). **Args:** * ​target (`StringSlice[origin]`): Target string to check. **Returns:** True if the target is valid (CPU or GPU), False otherwise. --- ## is_x86 `is_x86() -> Bool` Returns True if the host system architecture is X86 and False otherwise. **Deprecated:** Use `CompilationTarget.is_x86()` instead. **Returns:** True if the host system architecture is X86 and False otherwise. --- ## isclose `isclose[dtype: DType, width: Int](a: SIMD[dtype, width], b: SIMD[dtype, width], *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False) -> SIMD[bool, width]` Checks if the two input values are numerically within a tolerance. When the type is integral, then equality is checked. When the type is floating point, then this checks if the two input values are numerically the close using the $abs(a - b) dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​a (`SIMD[dtype, width]`): The first value to compare. * ​b (`SIMD[dtype, width]`): The second value to compare. * ​atol (`SIMD[float64, 1]`): The absolute tolerance. * ​rtol (`SIMD[float64, 1]`): The relative tolerance. * ​equal\_nan (`Bool`): Whether to treat nans as equal. **Returns:** A boolean vector where a and b are equal within the specified tolerance. --- ## isdir `isdir[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** True if the path is a directory or a link to a directory and False otherwise. --- ## isfile `isfile[PathLike: PathLike, //](path: PathLike) -> Bool` Test whether a path is a regular file. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns True if the path is a regular file. --- ## isfinite `isfinite[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]` Checks if the value is not infinite. This is always True for non-FP data types. **Parameters:** * ​dtype (`DType`): The value dtype. * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val (`SIMD[dtype, simd_width]`): The value to check. **Returns:** True if val is finite and False otherwise. --- ## isinf `isinf[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]` Checks if the value is infinite. This is always False for non-FP data types. **Parameters:** * ​dtype (`DType`): The value dtype. * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val (`SIMD[dtype, simd_width]`): The value to check. **Returns:** True if val is infinite and False otherwise. --- ## islink `islink[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path refers to an existing directory entry that is a symbolic link. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** True if the path is a link to a directory and False otherwise. --- ## isnan `isnan[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]` Checks if the value is Not a Number (NaN). **Parameters:** * ​dtype (`DType`): The value dtype. * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val (`SIMD[dtype, simd_width]`): The value to check. **Returns:** True if val is NaN and False otherwise. --- ## isqrt `isqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise reciprocal square root on a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal square root on. **Returns:** The elementwise reciprocal square root of x. --- ## j0 `j0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the first kind of order 0 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## j1 `j1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the first kind of order 1 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## join `join(owned path: String, *paths: String) -> String` Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded. An empty last part will result in a path that ends with a separator. **Args:** * ​path (`String`): The path to join. * ​\*paths (`String`): The paths to join. **Returns:** The joined path. --- ## k_matmul_ragged_paged `k_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)` Performs a matmul, writing the output into a mutable PagedKVCacheCollection object. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,) denoting the start of each sequence along the seq\_len dimension. * ​weight (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for this layer is retrieved via layer\_idx. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## keep `keep(val: Bool)` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Args:** * ​val (`Bool`): The value to not optimize away. `keep(val: Int)` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Args:** * ​val (`Int`): The value to not optimize away. `keep[type: DType, simd_width: Int](val: SIMD[type, simd_width])` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Parameters:** * ​type (`DType`): The `dtype` of the input and output SIMD vector. * ​simd\_width (`Int`): The width of the input and output SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The value to not optimize away. `keep[type: AnyType](val: UnsafePointer[type])` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Parameters:** * ​type (`AnyType`): The type of the input. **Args:** * ​val (`UnsafePointer[type]`): The value to not optimize away. `keep[type: AnyTrivialRegType](mut val: type)` Provides a hint to the compiler to not optimize the variable use away. This is useful in benchmarking to avoid the compiler not deleting the code to be benchmarked because the variable is not used in a side-effecting manner. **Parameters:** * ​type (`AnyTrivialRegType`): The type of the input. **Args:** * ​val (`type`): The value to not optimize away. --- ## Kernel A kernel is a function that runs on a GPU, executing computations in parallel across a large number of [threads](thread.mdx). Kernels are a fundamental part of general-purpose GPU (GPGPU) programming and are designed to process large datasets efficiently by performing the same operation simultaneously on multiple data elements. --- ## KernelConfig `struct KernelConfig` Static configuration of the matmul inner kernel. ## Fields * ​kernel\_rows (`Int`): * ​kernel\_cols (`Int`): * ​simd\_size (`Int`): * ​packed\_shape (`DimList`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, kernel_rows: Int, kernel_cols: Int, simd_size: Int, packed_shape: DimList)` --- ## KernelLibrary ## `KernelLibrary` {#max.graph.KernelLibrary} > *class* max.graph.KernelLibrary(context, paths=\[]) **Parameters:** * **context** (`mlir.Context` ) * **paths** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `Path` `]` ) ### `add_path()` {#max.graph.KernelLibrary.add_path} > add\_path(path) **Parameters:** **path** ([`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) ) ### `library_paths()` {#max.graph.KernelLibrary.library_paths} > library\_paths() **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)] ### `load_paths()` {#max.graph.KernelLibrary.load_paths} > load\_paths(context, custom\_extensions) Load the custom operations from provided library paths. Performs additional “smart” library loading logic for custom operation libraries in additional formats. The loading logic supports the following formats: * Compiled Mojo binary packages with .mojopkg extension * Mojo source directory with custom operations The loaded libraries are added to the current kernel library. **Parameters:** * **context** (`Context` ) – The MLIR context for loading MLIR operations * **custom\_extensions** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `]` ) – File paths to the custom operation libraries ### `verify_custom_op()` {#max.graph.KernelLibrary.verify_custom_op} > verify\_custom\_op(custom\_op) **Parameters:** **custom\_op** (`Operation` ) --- ## kernels Helper functions for wrapping custom kv cache/attention related ops. ## `AttentionMaskVariant` {#max.nn.kernels.AttentionMaskVariant} > *class* max.nn.kernels.AttentionMaskVariant(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `CAUSAL_MASK` {#max.nn.kernels.AttentionMaskVariant.CAUSAL_MASK} > CAUSAL\_MASK *= 'causal'* ### `CHUNKED_CAUSAL_MASK` {#max.nn.kernels.AttentionMaskVariant.CHUNKED_CAUSAL_MASK} > CHUNKED\_CAUSAL\_MASK *= 'chunked\_causal'* ### `NULL_MASK` {#max.nn.kernels.AttentionMaskVariant.NULL_MASK} > NULL\_MASK *= 'null'* ### `SLIDING_WINDOW_CAUSAL_MASK` {#max.nn.kernels.AttentionMaskVariant.SLIDING_WINDOW_CAUSAL_MASK} > SLIDING\_WINDOW\_CAUSAL\_MASK *= 'sliding\_window\_causal'* ### `TENSOR_MASK` {#max.nn.kernels.AttentionMaskVariant.TENSOR_MASK} > TENSOR\_MASK *= 'tensor\_mask'* ## `MHAMaskConfig` {#max.nn.kernels.MHAMaskConfig} > *class* max.nn.kernels.MHAMaskConfig(attention\_mask\_variant: 'AttentionMaskVariant', positional\_encoding\_variant: 'PositionalEncodingVariant') **Parameters:** * **attention\_mask\_variant** ([`AttentionMaskVariant`](#max.nn.kernels.AttentionMaskVariant) ) * **positional\_encoding\_variant** ([`PositionalEncodingVariant`](#max.nn.kernels.PositionalEncodingVariant) ) ### `attention_mask_variant` {#max.nn.kernels.MHAMaskConfig.attention_mask_variant} > attention\_mask\_variant\*: [AttentionMaskVariant](#max.nn.kernels.AttentionMaskVariant)\* ### `positional_encoding_variant` {#max.nn.kernels.MHAMaskConfig.positional_encoding_variant} > positional\_encoding\_variant\*: [PositionalEncodingVariant](#max.nn.kernels.PositionalEncodingVariant)\* ## `MHAMaskVariant` {#max.nn.kernels.MHAMaskVariant} > *class* max.nn.kernels.MHAMaskVariant(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `CAUSAL_ALIBI_MASK` {#max.nn.kernels.MHAMaskVariant.CAUSAL_ALIBI_MASK} > CAUSAL\_ALIBI\_MASK *= '1'* ### `CAUSAL_MASK` {#max.nn.kernels.MHAMaskVariant.CAUSAL_MASK} > CAUSAL\_MASK *= '0'* ### `CHUNKED_CAUSAL_MASK` {#max.nn.kernels.MHAMaskVariant.CHUNKED_CAUSAL_MASK} > CHUNKED\_CAUSAL\_MASK *= '3'* ### `NULL_MASK` {#max.nn.kernels.MHAMaskVariant.NULL_MASK} > NULL\_MASK *= '2'* ### `SLIDING_WINDOW_CAUSAL_MASK` {#max.nn.kernels.MHAMaskVariant.SLIDING_WINDOW_CAUSAL_MASK} > SLIDING\_WINDOW\_CAUSAL\_MASK *= '4'* ## `PositionalEncodingVariant` {#max.nn.kernels.PositionalEncodingVariant} > *class* max.nn.kernels.PositionalEncodingVariant(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `ALIBI_POS` {#max.nn.kernels.PositionalEncodingVariant.ALIBI_POS} > ALIBI\_POS *= 'alibi\_pos'* ### `NO_POS` {#max.nn.kernels.PositionalEncodingVariant.NO_POS} > NO\_POS *= 'no\_pos'* ## `causal_flash_attention_gpu()` {#max.nn.kernels.causal_flash_attention_gpu} > max.nn.kernels.causal\_flash\_attention\_gpu(q, k, v, scale) Computes causal flash attention using GPU-optimized kernel. :param q: Query tensor of shape \[batch, seq\_len, num\_heads, head\_dim] :param k: Key tensor of shape \[batch, seq\_len, num\_heads, head\_dim] :param v: Value tensor of shape \[batch, seq\_len, num\_heads, head\_dim] :param scale: Scaling factor for attention scores **Parameters:** * **q** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **k** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **v** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `cross_attention_ragged()` {#max.nn.kernels.cross_attention_ragged} > max.nn.kernels.cross\_attention\_ragged(kv\_params, input, input\_row\_offsets, kv\_collection, layer\_idx, mask\_variant, kv\_input\_row\_offsets, q\_max\_seq\_len, scale, local\_window\_size=-1) Computes cross attention provided the !mo.opaque KV Cache. Notably, this materializes the attention mask (dependent on MHAMaskVariant) within the kernel. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input attention, kv\_input\_row\_offsets represents the KV sequence length. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) ) * **kv\_input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **q\_max\_seq\_len** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **local\_window\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `dynamic_scaled_matmul()` {#max.nn.kernels.dynamic_scaled_matmul} > max.nn.kernels.dynamic\_scaled\_matmul(a, b, a\_scales, b\_scales, out\_type=bfloat16) Perform a matmul of two tensors with scaling factors. Currently only supports channel-wise scaling for weights and per-token scaling for inputs. **Parameters:** * **a** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The first tensor to multiply. * **b** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The second tensor to multiply, must be transposed. * **a\_scales** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The scaling factors for the first tensor. * **b\_scales** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The scaling factors for the second tensor. * **out\_type** ([`DType`](../dtype.md#max.dtype.DType) ) **Returns:** The result of the matmul operation. **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `flare_mla_decode_ragged()` {#max.nn.kernels.flare_mla_decode_ragged} > max.nn.kernels.flare\_mla\_decode\_ragged(kv\_params, input, input\_row\_offsets, kv\_collection, layer\_idx, mask\_variant, scale, qk\_rope\_dim=64) Computes flash (self) attention provided the !mo.opaque KV Cache. Notably, this materializes the attention mask (dependent on MHAMaskVariant) within the kernel. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input Note that this is self attention and the KV sequence length is assumed to be equal to the Q sequence length. For KV sequence length != Q sequence length, use cross\_attention\_ragged. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** (`PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **qk\_rope\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `flare_mla_decompress_k_cache()` {#max.nn.kernels.flare_mla_decompress_k_cache} > max.nn.kernels.flare\_mla\_decompress\_k\_cache(kv\_params, buffer\_row\_offsets\_1d, cache\_offsets\_1d, buffer\_length, weight, kv\_collection, layer\_idx, buffer\_size) This kernel decompresses the key cache by up-projecting latent representations into the KV space using a weight matrix. The process involves: : 1. Copying buffer\_length latent vectors from the key cache into a contiguous buffer (k\_latent) 2\. Computing k = k\_latent @ weight.T to obtain the decompressed keys **Returns:** A tensor of shape \[buffer\_size, weight.shape\[0]] containing the decompressed keys. Note that only the first buffer\_length tokens are valid. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **buffer\_row\_offsets\_1d** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **cache\_offsets\_1d** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **buffer\_length** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** (`PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **buffer\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `flare_mla_prefill_plan()` {#max.nn.kernels.flare_mla_prefill_plan} > max.nn.kernels.flare\_mla\_prefill\_plan(kv\_params, input\_row\_offsets, kv\_collection, layer\_idx, buffer\_size, max\_chunks=16) This kernel plans how to process a batch of sequences with varying lengths using a fixed-size buffer. Each sequence in the batch has some existing cached tokens and new input tokens. The kernel divides the total tokens into chunks of buffer\_size. For each chunk (iteration), it calculates: : 1. Buffer offsets for each sequence in each chunk 2\. Cache offsets for each sequence in each chunk 3\. Total buffer lengths for each processing iteration **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** (`PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **buffer\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_chunks** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue), [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue), [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)] ## `flare_mla_prefill_ragged()` {#max.nn.kernels.flare_mla_prefill_ragged} > max.nn.kernels.flare\_mla\_prefill\_ragged(kv\_params, input, k, v, input\_row\_offsets, buffer\_row\_offsets, cache\_offsets, kv\_collection, layer\_idx, mask\_variant, scale, qk\_rope\_dim=64, prev\_output=None, prev\_softmax\_info=None) Performs MLA prefill. In the MLA prefill, we need to decompress the KV tensors, as we store the latent representations in the KV cache. We will decompress the KV tensors into a fixed size buffer to avoid out-of-memory errors. In case the total cache length is greater than the buffer size, we will process the attention calculation in chunks. This MLA prefill kernel will return the output tensor for this iteration and the softmax info tensor for this iteration. Such tensors will be used by the next iteration of the MLA prefill kernel to continue the attention calculation. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KVCacheParams * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Input tensor * **k** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Key tensor * **v** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Value tensor * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Indicates where each batch starts and ends in input * **buffer\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Indicates where each batch starts and ends in the buffer * **cache\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Indicates where each batch starts and ends in the KV cache * **kv\_collection** (`PagedKVCacheCollection` ) – KV collection * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Layer index tensor * **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) ) – Mask variant * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Scale * **qk\_rope\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – QK rope dimension * **prev\_output** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) – Optional. Previous output tensor * **prev\_softmax\_info** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) – Optional. Previous softmax info tensor **Returns:** * The first tensor is the output tensor for this iteration * The second tensor is the softmax info tensor for this iteration **Return type:** A tuple of two tensors ## `flash_attention()` {#max.nn.kernels.flash_attention} > max.nn.kernels.flash\_attention(kv\_params, input, kv\_collection, layer\_idx, attention\_mask, valid\_lengths, scale) Computes flash attention provided the mo.opaque KV Cache. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **attention\_mask** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **valid\_lengths** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `flash_attention_ragged()` {#max.nn.kernels.flash_attention_ragged} > max.nn.kernels.flash\_attention\_ragged(kv\_params, input, input\_row\_offsets, kv\_collection, layer\_idx, mask\_variant, scale, local\_window\_size=-1) Computes flash (self) attention provided the !mo.opaque KV Cache. Notably, this materializes the attention mask (dependent on MHAMaskVariant) within the kernel. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input Note that this is self attention and the KV sequence length is assumed to be equal to the Q sequence length. For KV sequence length != Q sequence length, use cross\_attention\_ragged. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **local\_window\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `flash_attention_with_causal_mask()` {#max.nn.kernels.flash_attention_with_causal_mask} > max.nn.kernels.flash\_attention\_with\_causal\_mask(kv\_params, input, kv\_collection, layer\_idx, valid\_lengths, scale) Computes flash attention provided the mo.opaque KV Cache. Notably, materializes the causal mask within the kernel. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **valid\_lengths** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `fused_qk_ragged_rope()` {#max.nn.kernels.fused_qk_ragged_rope} > max.nn.kernels.fused\_qk\_ragged\_rope(kv\_params, input, input\_row\_offsets, kv\_collection, freqs\_cis, layer\_idx, interleaved=True) Computes fused query-key attention with rotary positional encodings and ragged inputs. **Parameters:** * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – \[batch\_size \* seq\_len, n\_heads, head\_dim] * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **freqs\_cis** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – tensor of shape (max\_seq\_len \* 2, head\_dim) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input ## `fused_qk_rope()` {#max.nn.kernels.fused_qk_rope} > max.nn.kernels.fused\_qk\_rope(kv\_params, input, kv\_collection, freqs\_cis\_2d, layer\_idx, interleaved=True) Computes fused query-key attention with rotary positional encodings. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) ) * **freqs\_cis\_2d** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `fused_qkv_matmul()` {#max.nn.kernels.fused_qkv_matmul} > max.nn.kernels.fused\_qkv\_matmul(kv\_params, input, wqkv, kv\_collection, layer\_idx, n\_heads) Computes fused query, key and value projections. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `fused_qkv_ragged_matmul()` {#max.nn.kernels.fused_qkv_ragged_matmul} > max.nn.kernels.fused\_qkv\_ragged\_matmul(kv\_params, input, input\_row\_offsets, wqkv, kv\_collection, layer\_idx, n\_heads, bias=None) Computes fused query, key, and value projections with ragged input. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **bias** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `fused_qkv_ragged_matmul_quantized()` {#max.nn.kernels.fused_qkv_ragged_matmul_quantized} > max.nn.kernels.fused\_qkv\_ragged\_matmul\_quantized(kv\_params, input, input\_row\_offsets, wqkv, kv\_collection, layer\_idx, n\_heads, quantization\_config, perm\_idx=None, bias=None) Computes fused query, key, and value projections with ragged input and quantized weight matrices. A quantization\_config must be provided. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig) ) * **perm\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) * **bias** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `fused_qkv_ragged_matmul_scaled_float8()` {#max.nn.kernels.fused_qkv_ragged_matmul_scaled_float8} > max.nn.kernels.fused\_qkv\_ragged\_matmul\_scaled\_float8(kv\_params, input, input\_row\_offsets, wqkv, kv\_collection, layer\_idx, n\_heads, input\_scale, weight\_scale, bias=None) Computes fused query, key, and value projections with ragged input. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** (`PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **input\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **bias** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `grouped_matmul_ragged()` {#max.nn.kernels.grouped_matmul_ragged} > max.nn.kernels.grouped\_matmul\_ragged(hidden\_states, weight, expert\_start\_indices, expert\_ids, expert\_usage\_stats\_host) Grouped matmul used in MoE layer. hidden\_states and expert\_start\_indices are used together to implement the ragged tensor. expert\_start\_indices indicates where each group starts and ends in hidden\_states expert\_ids is the id of the expert for each group in hidden\_states expert\_usage\_stats\_host is the maximum number of tokens assigned to any expert, and the number of active experts. **Parameters:** * **hidden\_states** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **expert\_start\_indices** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **expert\_ids** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **expert\_usage\_stats\_host** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `kv_cache_get_max_seq_len()` {#max.nn.kernels.kv_cache_get_max_seq_len} > max.nn.kernels.kv\_cache\_get\_max\_seq\_len(kv\_collection) This kernel returns the maximum sequence length. **Parameters:** **kv\_collection** (`PagedKVCacheCollection` ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `matmul_k_cache_ragged()` {#max.nn.kernels.matmul_k_cache_ragged} > max.nn.kernels.matmul\_k\_cache\_ragged(kv\_params, hidden\_states, input\_row\_offsets, weight, kv\_collection, layer\_idx) Computes key projections with ragged input. hidden\_states and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **hidden\_states** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** (`PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) **Return type:** None ## `matmul_kv_cache_ragged()` {#max.nn.kernels.matmul_kv_cache_ragged} > max.nn.kernels.matmul\_kv\_cache\_ragged(kv\_params, hidden\_states, input\_row\_offsets, weight, kv\_collection, layer\_idx) Computes key and value projections with ragged input. hidden\_states and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **hidden\_states** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **kv\_collection** (`PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) **Return type:** None ## `matmul_static_scaled_float8()` {#max.nn.kernels.matmul_static_scaled_float8} > max.nn.kernels.matmul\_static\_scaled\_float8(input, weight, input\_scale, weight\_scale) **Parameters:** * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `merge_ragged_tensors()` {#max.nn.kernels.merge_ragged_tensors} > max.nn.kernels.merge\_ragged\_tensors(a, a\_row\_offsets, b, b\_row\_offsets) Merges two ragged tensors into a single ragged tensor. Both ragged tensors must have the same batch size (same number of row offsets). This function interleaves the rows from each tensor based on their row offsets. **Parameters:** * **a** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The first ragged tensor of shape \[total\_a\_rows, …]. * **a\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The row offsets of the first ragged tensor,indicating where each batch starts and ends in a. * **b** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The second ragged tensor of shape \[total\_b\_rows, …]. * **b\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The row offsets of the second ragged tensor, indicating where each batch starts and ends in b. **Returns:** * The merged ragged tensor with shape \[total\_a\_rows + total\_b\_rows, …]. * The merged row offsets with the same shape as input row offsets. **Return type:** A tuple of two tensors ## Example a = [1, 2, 3, 4, 5, 6] a\_row\_offsets = [0, 2, 6] b = [7, 8, 9, 10] b\_row\_offsets = [0, 3, 4] merged\_tensor, merged\_row\_offsets = merge\_ragged\_tensors( : a, a\_row\_offsets, b, b\_row\_offsets) merged\_tensor = [1, 2, 7, 8, 9, 3, 4, 5, 6, 10] merged\_row\_offsets = [0, 5, 10] ## `moe_create_indices()` {#max.nn.kernels.moe_create_indices} > max.nn.kernels.moe\_create\_indices(topk\_ids, num\_local\_experts) Creates indices for the MoE layer. **Parameters:** * **topk\_ids** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The expert assignments for each token from the router. * **num\_local\_experts** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of experts on this device. **Returns:** * token\_expert\_order: The reordered token indices, grouped by assigned expert. * expert\_start\_indices: The starting index for each expert’s token group in the reordered sequence. * restore\_token\_order: The indices to restore original token ordering after expert computation. * expert\_ids: ids of active experts selected for tokens * expert\_usage\_stats: The maximum number of tokens assigned to any expert, and the number of active experts. **Return type:** A tuple of four tensors ## `null_mask_flash_attention_gpu()` {#max.nn.kernels.null_mask_flash_attention_gpu} > max.nn.kernels.null\_mask\_flash\_attention\_gpu(q, k, v, scale) Computes flash attention using GPU-optimized kernel. :param q: Query tensor of shape \[batch, seq\_len, num\_heads, head\_dim] :param k: Key tensor of shape \[batch, seq\_len, num\_heads, head\_dim] :param v: Value tensor of shape \[batch, seq\_len, num\_heads, head\_dim] :param scale: Scaling factor for attention scores **Parameters:** * **q** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **k** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **v** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `quantize_dynamic_scaled_float8()` {#max.nn.kernels.quantize_dynamic_scaled_float8} > max.nn.kernels.quantize\_dynamic\_scaled\_float8(input, scale\_ub=1200.0, group\_size\_or\_per\_token=-1, out\_type=float8\_e4m3fn, scales\_type=bfloat16) Dynamically quantize the input tensor to fp8. **Parameters:** * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The input tensor to quantize. * **scale\_ub** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – The upper bound of the scale factor. * **group\_size\_or\_per\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The group size for quantization. When set to -1, the quantization is column-wise. * **out\_type** ([`DType`](../dtype.md#max.dtype.DType) ) – The type of the output tensor. * **scales\_type** ([`DType`](../dtype.md#max.dtype.DType) ) – The type of the scales tensor. **Returns:** The quantized tensor and the scales. **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue), [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)] ## `quantize_static_scaled_float8()` {#max.nn.kernels.quantize_static_scaled_float8} > max.nn.kernels.quantize\_static\_scaled\_float8(x, scale, scale\_is\_inverted=True) **Parameters:** * **x** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **scale\_is\_inverted** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `rms_norm_key_cache()` {#max.nn.kernels.rms_norm_key_cache} > max.nn.kernels.rms\_norm\_key\_cache(kv\_params, kv\_collection, gamma, epsilon, layer\_idx, total\_seq\_len, input\_row\_offsets, weight\_offset, rms\_norm\_cols=None) Computes RMSNorm on the \_new\_ entries in the KVCache. This function applies RMSNorm to either all dimensions or a subset of dimensions in each head of the key cache. The size of the gamma tensor determines how many dimensions will be normalized. If gamma’s size doesn’t match head\_dim, rms\_norm\_cols must be explicitly specified to confirm the intention to normalize only a subset of dimensions. Currently, the KVCacheT class itself isn’t aware of the new cache entries until cache length increment, which happens after model forward. So use input\_row\_offsets to do this bookkeeping. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) * **gamma** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **epsilon** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **total\_seq\_len** ([`Dim`](../graph/type.md#max.graph.type.Dim) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **weight\_offset** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) ) * **rms\_norm\_cols** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) **Return type:** None ## `swish_glu()` {#max.nn.kernels.swish_glu} > max.nn.kernels.swish\_glu(a, b0, b1) Computes swish(.t()) \* (.t()) **Parameters:** * **a** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **b0** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **b1** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ## `unfused_qkv_ragged_matmul_gguf_quantized()` {#max.nn.kernels.unfused_qkv_ragged_matmul_gguf_quantized} > max.nn.kernels.unfused\_qkv\_ragged\_matmul\_gguf\_quantized(kv\_params, input, input\_row\_offsets, n\_heads, q\_weight, k\_weight, v\_weight, quantization\_encoding\_q, quantization\_encoding\_k, quantization\_encoding\_v, kv\_collection, layer\_idx) Computes fused query, key, and value projections with ragged input and quantized weight matrices. A quantization\_config must be provided. input and input\_row\_offsets are used together to implement the ragged tensor. input\_row\_offsets indicates where each batch starts and ends in input **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel. **Parameters:** * **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **q\_weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **k\_weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **v\_weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) * **quantization\_encoding\_q** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) ) * **quantization\_encoding\_k** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) ) * **quantization\_encoding\_v** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) ) * **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) `|` `PagedKVCacheCollection` ) * **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) --- ## KeyElement A trait composition for types which implement all requirements of dictionary keys. Dict keys must minimally be Copyable, Movable, Hashable, and EqualityComparable for a hash map. Until we have references they must also be copyable. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Hashable`, `Movable`, `UnknownDestructibility` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `__eq__` `__eq__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are equal according to the type's definition of equality, False otherwise. ### `__ne__` `__ne__(self: _Self, other: _Self) -> Bool` Define whether two instances of the object are not equal to each other. **Args:** * ​other (`_Self`): Another instance of the same type. **Returns:** True if the instances are not equal according to the type's definition of equality, False otherwise. ### `__hash__` `__hash__(self: _Self) -> UInt` Return a 64-bit hash of the type's data. **Returns:** A 64-bit integer hash of this instance's data. --- ## KV cache KV (key-value) cache is a memory structure used in [transformer](transformer.mdx) models to store key-value tensors output from [self-attention](self-attention.mdx) layers. The KV cache speeds up inference for transformer models such as large language models (LLMs) by avoiding the need to recompute the self-attention scores for all previous tokens in a sequence. For example, suppose an LLM is trying to complete the sentence, "The quick brown fox..." After the model predicts "jumps" and then begins to predict the next token, the model must know the attention score for every token in the sequence so far (including the one it just predicted). That is, for each step in the [autoregression](autoregression.mdx) cycle, it must process the entire sequence thus far: 1. "The quick brown fox..." 2. "The quick brown fox jumps..." 3. "The quick brown fox jumps over..." And so on. By storing the already-calculated attention scores for previous tokens in KV cache, the model simply reads the KV cache at each step, instead of recomputing those scores all over again. Once the model predicts the next token and calculates its self-attention, it adds it to the KV cache. As the sequence length grows during inference (as more words are generated), the KV cache becomes the dominant factor in a model's memory usage. The sequence length is always limited by the model's total context window length, which varies across models and can usually be configured. --- ## kv_cache ## Modules * [`cache_params`](/max/api/python/nn/kv_cache/cache_params) * [`continuous_batching_cache`](/max/api/python/nn/kv_cache/continuous_batching_cache) * [`hf`](/max/api/python/nn/kv_cache/hf) * [`manager`](/max/api/python/nn/kv_cache/manager) --- ## kv_cache Contains implementations for several types of key-value caches. [KV caches](/glossary/ai/kv-cache) are used in transformer models to store key-value tensors output from self-attention layers. These APIs are used in the higher-level functions in the [`nn`](/mojo/kernels/nn) package. ## Modules * [​`types`](./types/): This module contains the types for the key-value cache APIs. --- ## kv_cache ## Aliases ### `embed_fn_type` `alias embed_fn_type = fn[DType, Int](IndexList[4], SIMD[$0, $1]) capturing -> SIMD[$0, $1]` ## Functions * [​`generic_flash_attention_kv_cache_padded`](./generic_flash_attention_kv_cache_padded): * [​`generic_flash_attention_kv_cache_padded_materialized_mask`](./generic_flash_attention_kv_cache_padded_materialized_mask): * [​`generic_fused_qk_rope_bshd_continuous_batch`](./generic_fused_qk_rope_bshd_continuous_batch): Performs a fused RoPE projection for Q and K projections. * [​`generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch`](./generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_get_continuous_cache`](./generic_get_continuous_cache): * [​`generic_get_paged_cache`](./generic_get_paged_cache): * [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer): * [​`print_kv_cache_cont_batch_generic_cpu`](./print_kv_cache_cont_batch_generic_cpu): * [​`print_kv_cache_cont_batch_generic_gpu`](./print_kv_cache_cont_batch_generic_gpu): * [​`print_kv_cache_paged_generic_cpu`](./print_kv_cache_paged_generic_cpu): * [​`print_kv_cache_paged_generic_gpu`](./print_kv_cache_paged_generic_gpu): * [​`rms_norm_kv_cache_ragged_continuous_batching`](./rms_norm_kv_cache_ragged_continuous_batching): Performs RMSNorm in place on new entries in the key cache. * [​`rms_norm_kv_cache_ragged_paged`](./rms_norm_kv_cache_ragged_paged): Performs RMSNorm in place on new entries in the key cache. --- ## kv_cache_ragged ## Functions * [​`generic_cross_attention_kv_cache`](./generic_cross_attention_kv_cache): * [​`generic_flare_mla_decode_kv_cache_ragged`](./generic_flare_mla_decode_kv_cache_ragged): * [​`generic_flare_mla_decompress_k_cache_ragged_paged`](./generic_flare_mla_decompress_k_cache_ragged_paged): * [​`generic_flare_mla_prefill_kv_cache_ragged`](./generic_flare_mla_prefill_kv_cache_ragged): * [​`generic_flare_mla_prefill_ragged_paged_plan`](./generic_flare_mla_prefill_ragged_paged_plan): * [​`generic_flash_attention_kv_cache_ragged`](./generic_flash_attention_kv_cache_ragged): * [​`generic_fused_qk_rope_bshd_continuous_batch_ragged`](./generic_fused_qk_rope_bshd_continuous_batch_ragged): * [​`generic_fused_qk_rope_bshd_paged_ragged`](./generic_fused_qk_rope_bshd_paged_ragged): Performs a fused RoPE projection for Q and K projections. * [​`generic_fused_qkv_matmul_kv_cache_cont_batch_ragged`](./generic_fused_qkv_matmul_kv_cache_cont_batch_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_fused_qkv_matmul_kv_cache_paged_ragged`](./generic_fused_qkv_matmul_kv_cache_paged_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_bias`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_bias): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_scale`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_scale): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache. * [​`k_matmul_ragged_paged`](./k_matmul_ragged_paged): Performs a matmul, writing the output into a mutable PagedKVCacheCollection object. * [​`kv_matmul_ragged_paged`](./kv_matmul_ragged_paged): Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object. * [​`unfused_qkv_matmul_ragged_paged_gguf_quantized`](./unfused_qkv_matmul_ragged_paged_gguf_quantized): Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object. * [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer): --- ## kv_matmul_ragged_paged `kv_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)` Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object. **Args:** * ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,) denoting the start of each sequence along the seq\_len dimension. * ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for this layer is retrieved via layer\_idx. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## KVCacheMHAOperand `@register_passable(trivial)` `struct KVCacheMHAOperand[cache_t: KVCacheT]` An implementation for `mo.opaque` KVCacheT arguments to MHA kernels. We can eventually remove this trait and just add it as a sub-trait in the KVCacheT type, but we need to solve some cyclic dependencies first. ## Fields * ​cache (`cache_t`): ## Implemented traits `AnyType`, `Copyable`, `MHAOperand`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = get_vtable_entry(:trait cache_t, "type")` ## Methods ### `__init__` `__init__(cache: cache_t) -> Self` ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait cache_t, "type"), 1]]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` --- ## KVCacheStaticParams `@register_passable(trivial)` `struct KVCacheStaticParams` ## Fields * ​num\_heads (`UInt`): * ​head\_size (`UInt`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## KVCacheT Trait for different KVCache types and implementations. Represents a single (key or value) cache. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `kv_params` `alias kv_params` ### `type` `alias type` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `cache_lengths_nd` `cache_lengths_nd(self: _Self) -> NDBuffer[uint32, 1, MutableAnyOrigin]` Returns the cache lengths as a NDBuffer. ### `cache_length` `cache_length(self: _Self, batch_idx: Int) -> Int` Returns the length of the cache for a given batch index. ### `load` `load[width: Int](self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[get_vtable_entry(:trait _Self, "type"), width]` Loads an element from the given index. ### `store` `store(self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[get_vtable_entry(:trait _Self, "type"), size])` Stores an element at the given index. ### `empty_cache` `empty_cache(self: _Self) -> Bool` Returns true if the cache\_lengths for all requests is 0, false otherwise. ### `max_prompt_length` `max_prompt_length(self: _Self) -> SIMD[uint32, 1]` Returns the maximum sequence length across all batches of the current request. ### `max_context_length` `max_context_length(self: _Self) -> SIMD[uint32, 1]` Returns the maximum cache length used across all batches of the current request. ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self: _Self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]` Returns a LayoutTensor pointing to the KVCache block at the given index. Paged KVCache implementations must have a block\_size which is a multiple of the and greater than the layout's first dimension. ### `max_tile_size` `static max_tile_size() -> Int` Returns the maximum tile size for the KVCache. --- ## KVCollectionT Trait for a pair of caches (keys and values). ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `CacheType` `alias CacheType` ### `kv_params` `alias kv_params` ### `name_str` `alias name_str` ### `type` `alias type` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `get_key_cache` `get_key_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")` ### `get_value_cache` `get_value_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")` ### `cache_length` `cache_length(self: _Self, bs_idx: Int) -> Int` --- ## lane_group_max `lane_group_max[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Reduces a SIMD value to its maximum within a lane group using warp-level operations. This function performs a parallel reduction across a group of lanes to find the maximum value. The reduction is done using warp shuffle operations for efficient communication between lanes. The result is stored in all participating lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum. **Returns:** A SIMD value where all participating lanes contain the maximum value found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_max_and_broadcast `lane_group_max_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Reduces and broadcasts the maximum value within a lane group using warp-level operations. This function performs a parallel reduction to find the maximum value and broadcasts it to all lanes. The reduction and broadcast are done using warp shuffle operations in a butterfly pattern for efficient all-to-all communication between lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce and broadcast. Each lane contributes its value. **Returns:** A SIMD value where all participating lanes contain the maximum value found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_min `lane_group_min[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Reduces a SIMD value to its minimum within a lane group using warp-level operations. This function performs a parallel reduction across a group of lanes to find the minimum value. The reduction is done using warp shuffle operations for efficient communication between lanes. The result is stored in all participating lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum. **Returns:** A SIMD value where all participating lanes contain the minimum value found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_reduce `lane_group_reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], num_lanes: Int, *, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Performs a generic warp-level reduction operation using shuffle operations. This function implements a parallel reduction across threads in a warp using a butterfly pattern. It allows customizing both the shuffle operation and reduction function. Example: ```mojo from gpu.warp import lane_group_reduce, shuffle_down # Compute sum across 16 threads using shuffle down @parameter fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]: return x + y var val = SIMD[DType.float32, 16](42.0) var result = lane_group_reduce[shuffle_down, add, num_lanes=16](val) ``` . **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and offset and returns the shuffled result. * ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines the reduction operation (e.g. add, max, min). * ​num\_lanes (`Int`): The number of lanes in a group. The reduction is done within each group. Must be a power of 2. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value. **Returns:** A SIMD value containing the reduction result. --- ## lane_group_sum `lane_group_sum[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the sum of values across a group of lanes using warp-level operations. This function performs a parallel reduction across a group of lanes to compute their sum. The reduction is done using warp shuffle operations for efficient communication between lanes. The result is stored in all participating lanes. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum. **Returns:** A SIMD value where all participating lanes contain the sum found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_group_sum_and_broadcast `lane_group_sum_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the sum across a lane group and broadcasts the result to all lanes. This function performs a parallel reduction using a butterfly pattern to compute the sum, then broadcasts the result to all participating lanes. The butterfly pattern ensures efficient communication between lanes through warp shuffle operations. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​num\_lanes (`Int`): The number of threads participating in the reduction. * ​stride (`Int`): The stride between lanes participating in the reduction. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum. **Returns:** A SIMD value where all participating lanes contain the sum found across the lane group. Non-participating lanes (lane\_id >= num\_lanes) retain their original values. --- ## lane_id `lane_id() -> UInt` Returns the lane ID of the current thread within its warp. The lane ID is a unique identifier for each thread within a warp, ranging from 0 to WARP\_SIZE-1. This ID is commonly used for warp-level programming and thread synchronization within a warp. **Returns:** The lane ID (0 to WARP\_SIZE-1) of the current thread. --- ## lane_id `lane_id() -> UInt` Returns the lane ID of the current thread. **Returns:** The lane ID of the current thread. --- ## launch_attribute GPU Launch Attributes for Kernel Execution Control This module provides structures for configuring GPU kernel execution through launch attributes. It implements a Mojo interface to CUDA's launch attribute system, allowing fine-grained control over kernel execution characteristics such as memory access policies, synchronization behavior, cluster dimensions, and resource allocation. The main components include: * `LaunchAttributeID`: Identifies different types of launch attributes * `LaunchAttributeValue`: Stores the value for a specific attribute type * `LaunchAttribute`: Combines an ID and value to form a complete attribute * `AccessPolicyWindow`: Configures memory access patterns and caching behavior * `AccessProperty`: Defines cache persistence properties for memory access These structures enable optimizing GPU kernel performance by controlling execution parameters at a granular level, similar to CUDA's native launch attribute system. ## Structs * [​`AccessPolicyWindow`](/mojo/stdlib/gpu/host/launch_attribute/AccessPolicyWindow): Specifies an access policy for a window of memory. * [​`AccessProperty`](/mojo/stdlib/gpu/host/launch_attribute/AccessProperty): Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields. * [​`LaunchAttribute`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttribute): Represents a complete launch attribute with ID and value. * [​`LaunchAttributeID`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeID): Identifies the type of launch attribute for GPU kernel execution. * [​`LaunchAttributeValue`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeValue): Represents a value for a CUDA launch attribute. --- ## launch_dependent_grids `launch_dependent_grids()` Launches dependent grids that were previously configured to depend on the current grid. This function triggers the execution of dependent grids that have been configured with a dependency on the current grid. It maps directly to the CUDA grid dependency control instruction for launching dependent grids. Note: * Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs. * Must be called by all threads in a thread block to avoid undefined behavior. * Typically used in multi-grid pipeline scenarios where one grid's completion should trigger the execution of other grids. --- ## LaunchAttribute `@register_passable(trivial)` `struct LaunchAttribute` Represents a complete launch attribute with ID and value. This struct combines a `LaunchAttributeID` and `LaunchAttributeValue` to form a complete attribute that can be passed to GPU kernel launches. It provides a way to specify various execution parameters that control kernel behavior. ## Fields * ​id (`LaunchAttributeID`): The identifier specifying the type of this launch attribute. * ​\_\_pad (`StaticTuple[SIMD[uint8, 1], ((sizeof[::AnyType,__mlir_type.!kgen.target]() * -1) + 8)]`): Padding to ensure proper alignment of the structure. * ​value (`LaunchAttributeValue`): The value associated with this launch attribute. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initializes a new LaunchAttribute with IGNORE ID and zeroed value. `__init__(id: LaunchAttributeID, value: LaunchAttributeValue) -> Self` Initializes a `LaunchAttribute` with a specific ID and value. **Args:** * ​id (`LaunchAttributeID`): The `LaunchAttributeID` to set. * ​value (`LaunchAttributeValue`): The `LaunchAttributeValue` to set. `@implicit` `__init__(policy: AccessPolicyWindow) -> Self` Initializes a `LaunchAttribute` from an `AccessPolicyWindow`. Creates a launch attribute with `ACCESS_POLICY_WINDOW` ID and the provided policy. **Args:** * ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to use for this attribute. ### `from_cluster_dim` `static from_cluster_dim(dim: Dim) -> Self` Creates a `LaunchAttribute` for cluster dimensions. Creates a launch attribute with `CLUSTER_DIMENSION` ID and the provided dimensions. **Args:** * ​dim (`Dim`): The dimensions to use for this attribute. **Returns:** A new `LaunchAttribute` configured with the specified cluster dimensions. --- ## LaunchAttributeID `@register_passable(trivial)` `struct LaunchAttributeID` Identifies the type of launch attribute for GPU kernel execution. This struct represents the various types of launch attributes that can be specified when launching GPU kernels or configuring streams and graph nodes. Each attribute controls different aspects of kernel execution behavior such as memory access policies, synchronization, scheduling, and resource allocation. The attributes are compatible with CUDA's launch attribute system and provide fine-grained control over kernel execution characteristics. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `ACCESS_POLICY_WINDOW` `alias ACCESS_POLICY_WINDOW = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](1))` Valid for streams, graph nodes, launches. ### `CLUSTER_DIMENSION` `alias CLUSTER_DIMENSION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](4))` Valid for graph nodes, launches. ### `CLUSTER_SCHEDULING_POLICY_PREFERENCE` `alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](5))` Valid for graph nodes, launches. ### `COOPERATIVE` `alias COOPERATIVE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](2))` Valid for graph nodes, launches. ### `DEVICE_UPDATABLE_KERNEL_NODE` `alias DEVICE_UPDATABLE_KERNEL_NODE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](13))` Valid for graph nodes, launches. This attribute is graphs-only, and passing it to a launch in a non-capturing stream will result in an error. CUlaunchAttributeValue::deviceUpdatableKernelNode::deviceUpdatable can only be set to 0 or 1. Setting the field to 1 indicates that the corresponding kernel node should be device-updatable. On success, a handle will be returned via CUlaunchAttributeValue::deviceUpdatableKernelNode::devNode which can be passed to the various device-side update functions to update the node's kernel parameters from within another kernel. For more information on the types of device updates that can be made, as well as the relevant limitations thereof, see cudaGraphKernelNodeUpdatesApply. Nodes which are device-updatable have additional restrictions compared to regular kernel nodes. Firstly, device-updatable nodes cannot be removed from their graph via cuGraphDestroyNode. Additionally, once opted-in to this functionality, a node cannot opt out, and any attempt to set the deviceUpdatable attribute to 0 will result in an error. Device-updatable kernel nodes also cannot have their attributes copied to/from another kernel node via cuGraphKernelNodeCopyAttributes. Graphs containing one or more device-updatable nodes also do not allow multiple instantiation, and neither the graph nor its instantiated version can be passed to cuGraphExecUpdate. If a graph contains device-updatable nodes and updates those nodes from the device from within the graph, the graph must be uploaded with cuGraphUpload before it is launched. For such a graph, if host-side executable graph updates are made to the device-updatable nodes, the graph must be uploaded before it is launched again. ### `IGNORE` `alias IGNORE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](0))` Ignored entry, for convenient composition. ### `LAUNCH_COMPLETION_EVENT` `alias LAUNCH_COMPLETION_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](12))` Valid for launches. Set CUlaunchAttributeValue::launchCompletionEvent to record the event. Nominally, the event is triggered once all blocks of the kernel have begun execution. Currently this is a best effort. If a kernel B has a launch completion dependency on a kernel A, B may wait until A is complete. Alternatively, blocks of B may begin before all blocks of A have begun, for example if B can claim execution resources unavailable to A (e.g. they run on different GPUs) or if B is a higher priority than A. Exercise caution if such an ordering inversion could lead to deadlock. A launch completion event is nominally similar to a programmatic event with triggerAtBlockStart set except that it is not visible to cudaGridDependencySynchronize() and can be used with compute capability less than 9.0. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set). ### `MEM_SYNC_DOMAIN` `alias MEM_SYNC_DOMAIN = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](10))` Valid for streams, graph nodes, launches. ### `MEM_SYNC_DOMAIN_MAP` `alias MEM_SYNC_DOMAIN_MAP = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](9))` Valid for streams, graph nodes, launches. ### `PREFERRED_SHARED_MEMORY_CARVEOUT` `alias PREFERRED_SHARED_MEMORY_CARVEOUT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](14))` Valid for launches. On devices where the L1 cache and shared memory use the same hardware resources, setting CUlaunchAttributeValue::sharedMemCarveout to a percentage between 0-100 signals the CUDA driver to set the shared memory carveout preference, in percent of the total shared memory for that kernel launch. This attribute takes precedence over CU\_FUNC\_ATTRIBUTE\_PREFERRED\_SHARED\_MEMORY\_CARVEOUT. This is only a hint, and the CUDA driver can choose a different configuration if required for the launch. ### `PRIORITY` `alias PRIORITY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](8))` Valid for streams, graph nodes, launches. ### `PROGRAMMATIC_EVENT` `alias PROGRAMMATIC_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](7))` Valid for launches. Set CUlaunchAttributeValue::programmaticEvent to record the event. Event recorded through this launch attribute is guaranteed to only trigger after all block in the associated kernel trigger the event. A block can trigger the event through PTX launchdep.release or CUDA builtin function cudaTriggerProgrammaticLaunchCompletion(). A trigger can also be inserted at the beginning of each block's execution if triggerAtBlockStart is set to non-0. The dependent launches can choose to wait on the dependency using the programmatic sync (cudaGridDependencySynchronize() or equivalent PTX instructions). Note that dependents (including the CPU thread calling cuEventSynchronize()) are not guaranteed to observe the release precisely when it is released. For example, cuEventSynchronize() may only observe the event trigger long after the associated kernel has completed. This recording type is primarily meant for establishing programmatic dependency between device tasks. Note also this type of dependency allows, but does not guarantee, concurrent execution of tasks. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set). ### `PROGRAMMATIC_STREAM_SERIALIZATION` `alias PROGRAMMATIC_STREAM_SERIALIZATION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](6))` Valid for launches. Setting CUlaunchAttributeValue:: programmaticStreamSerializationAllowed to non-0 signals that the kernel will use programmatic means to resolve its stream dependency, so that the CUDA runtime should opportunistically allow the grid's execution to overlap with the previous kernel in the stream, if that kernel requests the overlap. The dependent launches can choose to wait on the dependency using the programmatic sync. ### `SYNCHRONIZATION_POLICY` `alias SYNCHRONIZATION_POLICY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](3))` Valid for streams. ## Methods ### `__init__` `__init__(*, other: Self) -> Self` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances are equal. Compares the underlying integer values of the attributes. **Args:** * ​other (`Self`): The other `LaunchAttribute` instance to compare with. **Returns:** True if the attributes are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances are not equal. **Args:** * ​other (`Self`): The other `LaunchAttribute` instance to compare with. **Returns:** True if the attributes are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances have the same value. This is an identity comparison that delegates to equality comparison. **Args:** * ​other (`Self`): The other \`LaunchAttribute instance to compare with. **Returns:** True if the attributes have the same value, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two `LaunchAttribute` instances have different values. **Args:** * ​other (`Self`): The other `LaunchAttribute` instance to compare with. **Returns:** True if the attributes have different values, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the `LaunchAttribute`. **Returns:** A string representation of the attribute. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the string representation of the attribute to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface. **Args:** * ​writer (`W`): The writer to write to. --- ## LaunchAttributeValue `@register_passable(trivial)` `struct LaunchAttributeValue` Represents a value for a CUDA launch attribute. This struct emulates a C union to store different types of launch attribute values. It provides a fixed-size storage that can be initialized with different attribute types such as AccessPolicyWindow or dimension specifications. Note: This implementation uses a fixed-size byte array to emulate the union behavior defined in the CUDA Driver API's CUlaunchAttributeValue. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initializes a new `LaunchAttributeValue` with zeroed storage. `@implicit` `__init__(policy: AccessPolicyWindow) -> Self` Initializes a `LaunchAttributeValue` from an `AccessPolicyWindow`. **Args:** * ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to store in this attribute value. `@implicit` `__init__(dim: Dim) -> Self` Initializes a LaunchAttributeValue from a Dim (dimension) object. **Args:** * ​dim (`Dim`): The dimension specification to store in this attribute value. `@implicit` `__init__(value: Bool) -> Self` Initializes a LaunchAttributeValue from a boolean object.. **Args:** * ​value (`Bool`): The boolean value to store in this attribute value. --- ## layer ## `Layer` {#max.nn.layer.Layer} > *class* max.nn.layer.Layer #### Deprecated Deprecated since version 25.2.. Base class for neural network components. Use [`Module`](#max.nn.layer.Module) instead. Provides functionality for adding hooks to the call function of each layer to support testing, debugging or profiling. ## `LayerList` {#max.nn.layer.LayerList} > *class* max.nn.layer.LayerList(layers) Stores a list of layers. Can be used as a regular python list. **Parameters:** **layers** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Layer`](#max.nn.layer.Layer) `]` ) ### `append()` {#max.nn.layer.LayerList.append} > append(layer) **Parameters:** **layer** ([`Layer`](#max.nn.layer.Layer) ) ### `extend()` {#max.nn.layer.LayerList.extend} > extend(layer) **Parameters:** **layer** ([`Layer`](#max.nn.layer.Layer) ) ### `insert()` {#max.nn.layer.LayerList.insert} > insert(i, layer) **Parameters:** **layer** ([`Layer`](#max.nn.layer.Layer) ) ### `sublayers` {#max.nn.layer.LayerList.sublayers} > *property* sublayers\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Module](#max.nn.layer.Module)]\* ## `Module` {#max.nn.layer.Module} > *class* max.nn.layer.Module Base class for model components with weight management. Provides functionality to create custom layers and construct networks with automatic weight tracking. The following example uses the [`Module`](#max.nn.layer.Module) class to create custom layers and build a neural network: ```python from max import nn from max.dtype import DType from max.graph import Weight, ops, DeviceRef class Linear(nn.Module): def __init__(self, in_dims, out_dims): super().__init__() self.weight = Weight("weight", DType.float32, (in_dim, out_dim), DeviceRef.CPU()) def __call__(self, x): return x @ self.weight.T class MLP(nn.Module): def __init__(self): self.up = Linear(5, 10) self.gate = Linear(5, 10) self.down = Linear(10, 5) def __call__(self, x): return self.down(ops.silu(self.gate(x)) + self.up(x)) model = MLP() print(model.state_dict()) # {"up.weight": Tensor([5, 10]), ...} ``` Constructing a graph without [`Module`](#max.nn.layer.Module) can result in name collisions with the weights (in this example, there would be three weights with the name Weight). With [`Module`](#max.nn.layer.Module), you can use [`state_dict()`](#max.nn.layer.Module.state_dict) or [`load_state_dict()`](#max.nn.layer.Module.load_state_dict) to initialize or set the weights values, and finalize the weight names to be unique within the model. ### `layer_weights` {#max.nn.layer.Module.layer_weights} > *property* layer\_weights\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Weight](../graph/Weight.md#max.graph.Weight)]\* ### `load_state_dict()` {#max.nn.layer.Module.load_state_dict} > load\_state\_dict(state\_dict, \*, override\_quantization\_encoding=False, weight\_alignment=None, strict=True) Sets the values of all weights in this model. **Parameters:** * **state\_dict** ([`Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`DLPackArray`](../driver.md#max.driver.DLPackArray) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `WeightData` `]` ) – A map from weight name to a numpy array or [`max.driver.Tensor`](../driver.md#max.driver.Tensor). * **override\_quantization\_encoding** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to override the weight quantization based on the loaded value. * **weight\_alignment** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – If specified, overrides the alignment for each weight in the Module. If left as None, each value in state\_dict must be aligned by the default dtype alignment. * **strict** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If True, raises an error if any keys in state\_dict were not used by the Module. **Raises:** * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If any weight in the model is not present in the state dict. * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If strict is True and state\_dict contains keys not used by the Module. **Return type:** None ### `raw_state_dict()` {#max.nn.layer.Module.raw_state_dict} > raw\_state\_dict() Returns all weights objects in the model. Unlike [`state_dict`](#max.nn.layer.Module.state_dict), this returns [`max.graph.Weight`](../graph/Weight.md#max.graph.Weight) objects instead of the assigned values. Some parameters inside the `Weight` can be configured before a graph is built. Do not change these attributes after building a graph: * [`align`](../graph/Weight.md#max.graph.Weight.align) * [`dtype`](../graph/Weight.md#max.graph.Weight.dtype) * [`quantization_encoding`](../graph/Weight.md#max.graph.Weight.quantization_encoding) * [`shape`](../graph/Weight.md#max.graph.Weight.shape) **Returns:** Map from weight name to the [`max.graph.Weight`](../graph/Weight.md#max.graph.Weight) object. **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*Weight*](../graph/Weight.md#max.graph.Weight)] ### `set_shared_weight()` {#max.nn.layer.Module.set_shared_weight} > set\_shared\_weight(name, weight) **Parameters:** * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **weight** ([`Weight`](../graph/Weight.md#max.graph.Weight) ) ### `state_dict()` {#max.nn.layer.Module.state_dict} > state\_dict(auto\_initialize=True) Returns values of all weights in the model. The values returned are the same as the values set in [`load_state_dict`](#max.nn.layer.Module.load_state_dict). If [`load_state_dict`](#max.nn.layer.Module.load_state_dict) has not been called and none of the weights have values, then they are initialized to zero. **Parameters:** **auto\_initialize** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Determines whether to initialize weights to zero if the weight value has not been loaded. If this is False, a ValueError is raised if an uninitialized weight is found. **Returns:** Map from weight name to the weight value (can be numpy array or [`max.driver.Tensor`](../driver.md#max.driver.Tensor)). **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*DLPackArray*](../driver.md#max.driver.DLPackArray) | [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)] ### `sublayers` {#max.nn.layer.Module.sublayers} > *property* sublayers\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Module](#max.nn.layer.Module)]\* ## `add_layer_hook()` {#max.nn.layer.add_layer_hook} > max.nn.layer.add\_layer\_hook(fn) Adds a hook to call a function after each layer’s `__call__`. The function will be passed four inputs: * layer * input\_args * input\_kwargs * outputs The function can either return None or new outputs that will replace the layer returned outputs. Note that input and outputs contain graph Values, which show limited information (like [`shape`](../graph/TensorValue.md#max.graph.TensorValue.shape) and [`dtype`](../graph/TensorValue.md#max.graph.TensorValue.dtype)). You can still see the computed values if you include the Value in the `graph.ops.output` op, or call `graph.ops.print`. Example of printing debug inputs: ```python def print_info(layer, args, kwargs, outputs): print("Layer:", type(layer).__name__) print("Input args:", args) print("Input kwargs:", kwargs) print("Outputs:", outputs) return outputs add_layer_hook(print_info) ``` **Parameters:** **fn** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `[` `[` [`Layer`](#max.nn.layer.Layer) `,` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,` `...` `]` `,` [`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `,` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `,` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` ) **Return type:** None ## `clear_hooks()` {#max.nn.layer.clear_hooks} > max.nn.layer.clear\_hooks() Remove all hooks. ## `recursive_named_layers()` {#max.nn.layer.recursive_named_layers} > max.nn.layer.recursive\_named\_layers(parent, prefix='') Recursively walks through the layers and generates names. **Parameters:** * **parent** ([`Module`](#max.nn.layer.Module) ) * **prefix** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) **Return type:** [*Iterable*](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable)\[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*Module*](#max.nn.layer.Module)]] --- ## layer_norm Layer Normalization layer. ## `LayerNorm` {#max.nn.norm.layer_norm.LayerNorm} > *class* max.nn.norm.layer\_norm.LayerNorm(dims, device, dtype, eps=1e-05, use\_bias=True) Layer normalization block. **Parameters:** * **dims** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`DeviceRef` ) * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) * **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ## `LayerNormV1` {#max.nn.norm.layer_norm.LayerNormV1} > *class* max.nn.norm.layer\_norm.LayerNormV1(weight, bias=None, eps=1e-06) Layer normalization block. Deprecated: Use LayerNorm instead. **Parameters:** * **weight** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) ) * **bias** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` `None` ) * **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ### `bias` {#max.nn.norm.layer_norm.LayerNormV1.bias} > bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `eps` {#max.nn.norm.layer_norm.LayerNormV1.eps} > eps\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 1e-06* ### `weight` {#max.nn.norm.layer_norm.LayerNormV1.weight} > weight\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* --- ## layer_norm `layer_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_1_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], gamma_shape: IndexList[1], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)` --- ## layer_norm_cpu `layer_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](out_buf: NDBuffer[type, 2, origin, shape], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1])` Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$. Currently performs 3 passes over the input data. This can be reduced to 2 by fusing the add, mean, and variance loops using Welford's algorithm. **Parameters:** * ​type (`DType`): The x and out buffers' elements dtype. * ​input\_fn (`fn[Int](Int, Int) capturing -> SIMD[type, $0]`): Function called to generate an input value. * ​gamma\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): Function called to generate a gamma value. **Args:** * ​out\_buf (`NDBuffer[type, 2, origin, shape]`): The output buffer. * ​beta (`NDBuffer[type, 1, origin]`): The beta value to use in the layernorm calculation. * ​epsilon (`SIMD[type, 1]`): The eps value to use in the layernorm calculation. `layer_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides])` --- ## layer_norm_gpu `layer_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], *, ctx: DeviceContext)` --- ## layer_norm_gpu_block `layer_norm_gpu_block[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])` --- ## layer_norm_gpu_warp_tiling `layer_norm_gpu_warp_tiling[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])` --- ## layer_norm_reshape `layer_norm_reshape[type: DType, rank: Int, //, output_rank: Int](shape: IndexList[rank, element_type=element_type], buf: NDBuffer[type, rank, origin, shape, strides]) -> NDBuffer[type, output_rank, origin]` --- ## layer_norm_shape `layer_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin, __init__[::Intable](1)], beta: NDBuffer[type, 1, origin, __init__[::Intable](1)], epsilon: SIMD[type, 1]) -> IndexList[rank]` Compute the output shape of a `layer_norm` operation. **Parameters:** * ​type (`DType`): Type of the input tensors. * ​rank (`Int`): Rank of the input tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​gamma (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for gamma coefficient. * ​beta (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for beta coefficient. * ​epsilon (`SIMD[type, 1]`): The tensor for epsilon coefficient. **Returns:** The output shape. --- ## layout Provides layout and layout tensor types, which abstract memory layout for multidimensional data. * The [`Layout`](/mojo/kernels/layout/layout/Layout) type represents a mapping between a set of logical coordinates and a linear index. It can be used, for example, to map logical tensor coordinates to a memory address, or to map GPU threads to tiles of data. * The [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) type is a high-performance tensor with explicit memory layout via a `Layout`. ## Modules * [​`element`](./element/): Provides element-based access to memory using layout-driven vectorization. * [​`int_tuple`](./int_tuple/): Hierarchical integer tuple data structures for high-performance tensor operations. * [​`layout`](./layout/): Provides a high-performance tensor layout system for memory mapping and indexing. * [​`layout_tensor`](./layout_tensor/): Provides the `LayoutTensor` type for representing multidimensional data. * [​`math`](./math/): Implements math methods that work on layout tensors. * [​`runtime_layout`](./runtime_layout/): Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time. * [​`runtime_tuple`](./runtime_tuple/): Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates. * [​`swizzle`](./swizzle/): Defines swizzle layouts for optimizing memory access patterns. * [​`tensor_builder`](./tensor_builder/): Tensor Builder Module * [​`tensor_core`](./tensor_core/): Tensor Core Module for High-Performance Matrix Operations * [​`tensor_core_async`](./tensor_core_async/): Tensor Core Async Module * [​`tma_async`](./tma_async/): Tensor Memory Accelerator (TMA) Asynchronous Operations Module --- ## layout Provides a high-performance tensor layout system for memory mapping and indexing. The layout module implements a comprehensive system for describing memory layouts of multi-dimensional tensors, enabling efficient mapping between logical tensor coordinates and physical memory locations. This is a critical component for high-performance tensor operations in machine learning and scientific computing. These low-level primitives require careful use to avoid errors. Understanding the relationship between tensor shapes, strides, and memory layout is essential for effective use. Key components: * `LayoutTrait`: Core trait defining the interface for all layout types * `Layout`: Primary struct implementing memory layout with shape and stride information * Layout algebra: Functions for composing, dividing, and transforming layouts * Tiling operations: Functions for hierarchical decomposition of layouts Performance features: * Zero-cost abstractions for mapping between logical and physical indices * Support for both compile-time and runtime-determined shapes * Efficient memory access patterns through layout transformations * Hierarchical tiling for cache-friendly memory access Common use cases: * Defining memory layouts for tensors with different storage formats (row-major, column-major) * Implementing efficient tensor operations with optimal memory access patterns * Supporting hardware-specific memory layouts for accelerators * Enabling zero-copy tensor views and reshaping operations Example: ```mojo from layout import Layout, IntTuple from layout.layout import blocked_product # Create a 3x4 row-major layout var layout = Layout.row_major(3, 4) # Access the memory location for logical coordinates (1, 2) var memory_idx = layout([1, 2]) # Create a tiled layout for blocked matrix multiplication var tiled = blocked_product(layout, Layout([2, 2])) ``` ## Aliases ### `LayoutList` `alias LayoutList = List[Layout]` ## Structs * [​`Layout`](./Layout): Represents a memory layout for multi-dimensional data. ## Traits * [​`LayoutTrait`](./LayoutTrait): Defines the interface for mapping between logical coordinates and memory indices. ## Functions * [​`apply_tiler`](./apply_tiler): Applies a layout transformation function to each element of a layout with a tiler. * [​`blocked_product`](./blocked_product): Creates a blocked layout by combining two layouts. * [​`coalesce`](./coalesce): Simplifies a layout by combining dimensions with contiguous strides. * [​`complement`](./complement): Computes the complement layout for a given layout. * [​`composition`](./composition): Composes two layouts to create a new layout. * [​`cosize`](./cosize): Returns the size of the memory region spanned by the layout. * [​`downcast`](./downcast): Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved. * [​`expand_modes_alike`](./expand_modes_alike): Aligns two shape-stride pairs to have the same hierarchical structure. * [​`expand_strides`](./expand_strides): Expands a scalar stride into a stride tuple matching a shape tuple. * [​`format_layout`](./format_layout): Formats a 2D layout as a table and writes it to the specified writer. * [​`hierarchical_unzip`](./hierarchical_unzip): Hierarchically unzips a layout according to a list of layouts. * [​`is_contiguous_dim`](./is_contiguous_dim): Checks if a flat layout is contiguous in a specific dimension. * [​`is_row_major`](./is_row_major): Checks if a layout has row-major ordering for the specified rank. * [​`logical_divide`](./logical_divide): Divides a layout into blocks according to another layout. * [​`logical_product`](./logical_product): Creates a product of two layouts. * [​`make_layout`](./make_layout): Creates a composite layout by concatenating multiple layouts. * [​`make_ordered_layout`](./make_ordered_layout): Creates a layout with strides ordered according to a specified traversal order. * [​`MakeLayoutList`](./MakeLayoutList): Creates a list containing two layouts. * [​`MakeTileLayoutList`](./MakeTileLayoutList): Creates a list of layouts for tiling operations. * [​`print_layout`](./print_layout): Prints a 2D layout to the standard output. * [​`right_inverse`](./right_inverse): Creates a right inverse of a layout. * [​`size`](./size): Returns the total number of elements in the layout's domain. * [​`sublayout`](./sublayout): Creates a sublayout by selecting specific dimensions from a layout. * [​`tile_to_shape`](./tile_to_shape): Creates a layout by tiling a base layout to match a target shape. * [​`upcast`](./upcast): Fuses consecutive elements in a layout to create a coarser layout. * [​`zip_modes`](./zip_modes): Combines corresponding modes from two layouts. * [​`zipped_divide`](./zipped_divide): Divides a layout into blocks according to another layout. --- ## Layout `struct Layout` Represents a memory layout for multi-dimensional data. The Layout struct is the primary implementation of the LayoutTrait, providing a concrete representation of memory layouts using shape and stride information. It maps between logical coordinates and linear memory indices, enabling efficient access to multi-dimensional data. A Layout consists of: * shape: Defines the dimensions of the logical coordinate space * stride: Defines the step sizes in memory for each dimension The Layout struct supports various operations including: * Creation of row-major and column-major layouts * Conversion between coordinates and indices * Composition with other layouts * Iteration over sub-layouts Layouts can be hierarchical, with nested shapes and strides, allowing for complex memory access patterns like blocked or tiled layouts. ## Fields * ​shape (`IntTuple`): The dimensions of the layout. This field defines the size of each dimension in the logical coordinate space. For example, a shape of (3, 4) represents a 3×4 grid of elements. * ​stride (`IntTuple`): The memory step sizes for each dimension. This field defines how many elements to skip in memory when moving one unit in each dimension. For example, in a row-major 3×4 layout, the strides might be (4, 1), meaning moving one unit in the first dimension requires skipping 4 elements in memory, while moving one unit in the second dimension requires skipping 1 element. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `LayoutTrait`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `has_shape` `alias has_shape = True` Indicates whether the layout has a valid shape. ## Methods ### `__init__` `__init__(out self)` Initializes an empty layout with no dimensions. Creates a layout with empty shape and stride tuples, which can be populated later using append operations. `@implicit` `__init__(out self, shape: IntTuple[origin])` Initializes a layout with the given shape and column-major strides. Creates a layout with the specified shape and automatically calculates column-major strides (where the first dimension varies fastest in memory). **Args:** * ​shape (`IntTuple[origin]`): The dimensions of the layout. `__init__(out self, shape: IntTuple[origin], stride: IntTuple[origin])` Initializes a layout with the given shape and stride. Creates a layout with explicitly specified shape and stride values. If an empty stride is provided, column-major strides are calculated. **Args:** * ​shape (`IntTuple[origin]`): The dimensions of the layout. * ​stride (`IntTuple[origin]`): The memory step size for each dimension, or empty for column-major. `__init__(out self, *, other: Self)` Explicitly constructs a deep copy of the provided layout. **Args:** * ​other (`Self`): The layout to copy. ### `__getitem__` `__getitem__(self, index: Int) -> Self` Returns a sub-layout for the specified dimension. **Args:** * ​index (`Int`): The dimension index to extract. **Returns:** A Layout containing the shape and stride for the specified dimension. ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if this layout is equal to another layout. Two layouts are considered equal if they have identical shape and stride tuples. **Args:** * ​other (`Self`): The layout to compare with. **Returns:** True if the layouts are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if this layout is not equal to another layout. **Args:** * ​other (`Self`): The layout to compare with. **Returns:** True if the layouts are not equal, False otherwise. ### `idx2crd` `idx2crd(self, idx: IntTuple[origin]) -> IntTuple` Converts a linear index to logical coordinates. This is the inverse operation of the **call** method, mapping from a memory index back to the corresponding logical coordinates. **Args:** * ​idx (`IntTuple[origin]`): The linear index to convert. **Returns:** The logical coordinates corresponding to the given index. ### `col_major` `static col_major(*dims: Int) -> Self` Creates a column-major layout with the specified dimensions. In a column-major layout, the first dimension varies fastest in memory, which is the default layout in languages like Fortran and MATLAB. Example: ```mojo from layout import Layout # Create a 3x4 column-major layout var layout = Layout.col_major(3, 4) # Result: Layout with shape (3,4) and stride (1,3) ``` . **Args:** * ​\*dims (`Int`): Variable number of dimension sizes. **Returns:** A column-major Layout with the specified dimensions `static col_major(shape: IntTuple[origin]) -> Self` Creates a column-major layout with the specified shape. In a column-major layout, the first dimension varies fastest in memory, which is the default layout in languages like Fortran and MATLAB. Example: ```mojo from layout import Layout from layout.int_tuple import IntTuple # Create a 3x4 column-major layout var layout = Layout.col_major(IntTuple(3, 4)) # Result: Layout with shape (3,4) and stride (1,3) ``` . **Args:** * ​shape (`IntTuple[origin]`): An IntTuple specifying the dimensions. **Returns:** A column-major Layout with the specified shape ### `row_major` `static row_major(*dims: Int) -> Self` Creates a row-major layout with the specified dimensions. In a row-major layout, the last dimension varies fastest in memory, which is the default layout in languages like C, C++, and Python. Example: ```mojo from layout import Layout # Create a 3x4 row-major layout var layout = Layout.row_major(3, 4) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Args:** * ​\*dims (`Int`): Variable number of dimension sizes. **Returns:** A row-major Layout with the specified dimensions `static row_major[rank: Int](dims: DimList) -> Self` Creates a row-major layout from a DimList with compile-time rank. This method creates a row-major layout where the last dimension varies fastest in memory. It handles both known and unknown dimensions at compile time, properly calculating strides for each dimension. If any dimension is unknown, subsequent strides will also be marked as unknown. Example: ```mojo from layout import Layout from layout.layout import DimList # Create a row-major layout with compile-time rank var dims = DimList(3, 4) var layout = Layout.row_major[2](dims) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Parameters:** * ​rank (`Int`): The compile-time rank (number of dimensions) of the layout. **Args:** * ​dims (`DimList`): A DimList containing the dimensions of the layout. **Returns:** A row-major Layout with the specified dimensions and computed strides. `static row_major(shape: IntTuple[origin]) -> Self` Creates a row-major layout from an IntTuple of dimensions. In a row-major layout, the last dimension varies fastest in memory. This method computes the appropriate strides for a row-major layout given the input shape. Example: ```mojo from layout import Layout from layout.int_tuple import IntTuple # Create a row-major layout from a shape tuple var shape = IntTuple(3, 4) var layout = Layout.row_major(shape) # Result: Layout with shape (3,4) and stride (4,1) ``` . **Args:** * ​shape (`IntTuple[origin]`): An IntTuple containing the dimensions of the layout. **Returns:** A row-major Layout with the specified shape and computed strides. ### `make_shape_unknown` `make_shape_unknown[axis: Int = -1](self) -> Self` Creates a new Layout with unknown shape dimensions. This method creates a copy of the current Layout but marks either all dimensions or a specific dimension as unknown, while preserving the original strides. This is useful for tiling tensors with runtime sizes where the tile's shape is unknown but the memory layout (strides) remains constant. Example: ```mojo from layout import Layout from layout.int_tuple import IntTuple # Mark all dimensions as unknown var layout = Layout(IntTuple(2, 3)) var unknown = layout.make_shape_unknown() # Result: Layout with shape (?, ?) and original strides # Mark only first dimension as unknown var partial = layout.make_shape_unknown[0]() # Result: Layout with shape (?, 3) and original strides ``` . **Parameters:** * ​axis (`Int`): The dimension to mark as unknown. If UNKNOWN\_VALUE (default), all dimensions are marked as unknown. **Returns:** A new Layout with the specified dimension(s) marked as unknown and original strides preserved. ### `copy` `copy(self) -> Self` Explicitly constructs a copy of this layout. Creates a deep copy of the layout, including its shape and stride tuples. **Returns:** A new Layout instance with identical shape and stride values. ### `__str__` `__str__(self) -> String` Converts the layout to a string representation. **Returns:** A string representation of the layout in the format "(shape:stride)". ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the layout to the specified writer. Formats the layout as "(shape:stride)" and writes it to the provided writer. **Parameters:** * ​W (`Writer`): Type parameter representing a Writer implementation. **Args:** * ​writer (`W`): The writer to output the layout representation to. ### `__len__` `__len__(self) -> Int` Returns the number of dimensions in the layout. **Returns:** The number of elements in the shape tuple. ### `__iter__` `__iter__(self) -> _LayoutIter[self]` Returns an iterator over the layout's dimensions. Each iteration yields a Layout containing the shape and stride for one dimension. **Returns:** An iterator over the layout's dimensions. ### `size` `size(self) -> Int` Returns the total number of elements in the layout's domain. Calculates the product of all dimensions in the shape. **Returns:** The total number of elements in the layout. ### `cosize` `cosize(self) -> Int` Returns the size of the memory region spanned by the layout. Calculates the maximum memory index plus one, representing the total memory footprint required by the layout. **Returns:** The size of the memory region required by the layout. ### `rank` `rank(self) -> Int` Returns the number of dimensions in the layout. This is equivalent to **len** and returns the number of elements in the shape tuple. **Returns:** The number of dimensions in the layout. ### `__call__` `__call__(self, idx: IntTuple[origin]) -> Int` Maps logical coordinates to a linear memory index. This is the core functionality of a layout, converting multi-dimensional coordinates to a linear memory location. **Args:** * ​idx (`IntTuple[origin]`): The logical coordinates to map. **Returns:** The linear memory index corresponding to the given coordinates. ### `append` `append(mut self, item: Self)` Appends another layout to this layout. This method adds the shape and stride from the provided layout to this layout, effectively increasing its dimensionality. **Args:** * ​item (`Self`): The layout to append to this layout. ### `all_dims_known` `all_dims_known(self) -> Bool` Checks if all dimensions in the layout have known values. A dimension is considered unknown if its shape or stride is set to the special `UNKNOWN_VALUE` constant. **Returns:** True if all dimensions have known shape and stride values, False otherwise. ### `known_shape` `known_shape(self) -> Bool` Checks if all shape dimensions in the layout have known values. A dimension is considered unknown if its shape is set to the special `UNKNOWN_VALUE` constant. This method only checks shapes, not strides. **Returns:** True if all shape dimensions have known values, False otherwise. --- ## layout_tensor Provides the `LayoutTensor` type for representing multidimensional data. ## Aliases ### `binary_op_type` `alias binary_op_type = fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]` Type alias for binary operations on SIMD vectors. This type represents a function that takes two SIMD vectors of the same type and width and returns a SIMD vector of the same type and width. Args: type: The data type of the SIMD vector elements. width: The width of the SIMD vector. lhs: Left-hand side SIMD vector operand. rhs: Right-hand side SIMD vector operand. Returns: A SIMD vector containing the result of the binary operation. ## Structs * [​`LayoutTensor`](./LayoutTensor): A high-performance tensor with explicit memory layout and hardware-optimized access patterns. * [​`LayoutTensorIter`](./LayoutTensorIter): Iterator for traversing a memory buffer with a specific layout. * [​`ThreadScope`](./ThreadScope): Represents the scope of thread operations in GPU programming. ## Functions * [​`copy`](./copy): Synchronously copy data from local memory (registers) to SRAM (shared memory). * [​`copy_dram_to_local`](./copy_dram_to_local): Efficiently copy data from global memory (DRAM) to registers for AMD GPUs. * [​`copy_dram_to_sram`](./copy_dram_to_sram): Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. * [​`copy_dram_to_sram_async`](./copy_dram_to_sram_async): Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context. * [​`copy_local_to_dram`](./copy_local_to_dram): Efficiently copy data from registers (LOCAL) to global memory (DRAM). * [​`copy_local_to_local`](./copy_local_to_local): Synchronously copy data between local memory (register) tensors with type conversion. * [​`copy_sram_to_dram`](./copy_sram_to_dram): Synchronously copy data from SRAM (shared memory) to DRAM (global memory). * [​`copy_sram_to_local`](./copy_sram_to_local): Synchronously copy data from SRAM (shared memory) to local memory. * [​`cp_async_k_major`](./cp_async_k_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout. * [​`cp_async_mn_major`](./cp_async_mn_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout. * [​`stack_allocation_like`](./stack_allocation_like): Create a stack-allocated tensor with the same layout as an existing tensor. --- ## LayoutTensor `@register_passable(trivial)` `struct LayoutTensor[mut: Bool, //, dtype: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1)), layout_int_type: DType = _get_layout_type(layout, address_space), linear_idx_type: DType = _get_index_type(layout, address_space), masked: Bool = False, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]` A high-performance tensor with explicit memory layout and hardware-optimized access patterns. `LayoutTensor` provides a powerful abstraction for multi-dimensional data with precise control over memory organization. It supports various memory layouts (row-major, column-major, tiled), hardware-specific optimizations, and efficient parallel access patterns. Example: ```mojo from layout import Layout, LayoutTensor var storage = InlineArray[Scalar[DType.float32], 5 * 4](uninitialized = True) var tensor_5x4 = LayoutTensor[DType.float32, Layout.row_major(5,4)](storage) ``` ## Parameters * ​mut (`Bool`): The inferred mutability of the underlying pointer. * ​dtype (`DType`): The data type of the underlying pointer. * ​layout (`Layout`): The memory layout of the tensor. * ​origin (`Origin[mut]`): The origin of the underlying pointer. * ​address\_space (`AddressSpace`): The address space of the underlying pointer. * ​element\_layout (`Layout`): The memory layout of each element in the tensor. * ​layout\_int\_type (`DType`): The integer type of each dimension of runtime layout. * ​linear\_idx\_type (`DType`): The integer type of the index pointing to memory locations. * ​masked (`Bool`): If true the tensor is masked and runtime layouts determine the shape. * ​alignment (`Int`): Alignment of the data pointer. ## Fields * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the underlying memory buffer containing the tensor data. This pointer respects the specified address space, alignment, mutability, and origin tracking for memory safety and performance optimization. * ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the tensor's memory layout. Handles both compile-time and runtime-determined dimensions, enabling efficient mapping between logical tensor coordinates and physical memory locations. * ​runtime\_element\_layout (`RuntimeLayout[element_layout, element_type=int32, linear_idx_type=linear_idx_type]`): Runtime representation of each element's internal layout. Used when elements themselves have structure, such as in blocked or tiled layouts. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `element_size` `alias element_size = element_layout.size()` The number of scalar values in each element of the tensor. ### `element_type` `alias element_type = SIMD[dtype, element_layout.size()]` The SIMD vector type used for vectorized operations on tensor elements. ### `rank` `alias rank = layout.rank()` The number of dimensions in the tensor's layout. ## Methods ### `__init__` `@implicit` `__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]) -> Self` Create a `LayoutTensor` with a `Span`. **Constraints:** Layout must be fully static. **Args:** * ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data. `__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with a `Span` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** * Element layout must be fully static. **Args:** * ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor. `__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with a `Span`, a runtime layout of the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** * Runtime layout and `LayoutTensor` must have the same bitwidth and index type. **Args:** * ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. `@implicit` `__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Create a `LayoutTensor` with an `UnsafePointer`. **Constraints:** Layout must be fully static. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data. `__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with an `UnsafePointer` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** Element layout must be fully static. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The UnsafePointer pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor. `__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self` Create a `LayoutTensor` with an `UnsafePointer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. `@implicit` `__init__(ref [origin] device_buffer: DeviceBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `DeviceBuffer`. The layout must have statically known dimensions. ```mojo from gpu.host import DeviceContext, DeviceBuffer from layout import Layout, LayoutTensor alias dtype = DType.float32 var ctx = DeviceContext() var dev_buf = ctx.enqueue_create_buffer[dtype](8) alias layout = Layout.row_major(4, 4) var tensor = LayoutTensor[dtype, layout](dev_buf) ``` **Constraints:** * Layout must be fully static. **Args:** * ​device\_buffer (`DeviceBuffer[dtype]`): Contains the underlying data to point to. `@implicit` `__init__(ref [origin] host_buffer: HostBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `HostBuffer`. The layout must have statically known dimensions. ```mojo from gpu.host import DeviceContext, DeviceBuffer from layout import Layout, LayoutTensor alias dtype = DType.float32 var ctx = DeviceContext() var dev_buf = ctx.enqueue_create_buffer[dtype](8) alias layout = Layout.row_major(4, 4) var tensor = LayoutTensor[dtype, layout](dev_buf) ``` **Constraints:** * Layout must be fully static. **Args:** * ​host\_buffer (`HostBuffer[dtype]`): Contains the underlying data to point to. `__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `DeviceBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** * Element layout must be fully static. **Args:** * ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor. `__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `HostBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type. **Constraints:** * Element layout must be fully static. **Args:** * ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. `__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `DeviceBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. **Args:** * ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. `__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Create a `LayoutTensor` from a `HostBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type. **Args:** * ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`. * ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element. ### `__getitem__` `__getitem__(self, *dims: Int) -> SIMD[dtype, element_layout.size()]` Retrieves a single element from the tensor at the specified indices. This method provides array-like indexing for the tensor. The number of indices provided must match the rank of the tensor, otherwise an error will occur at runtime. **Args:** * ​\*dims (`Int`): The indices specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k). **Returns:** The element at the specified position with the tensor's data type. `__getitem__(self, crd: RuntimeTuple[S, element_type=element_type]) -> SIMD[dtype, element_layout.size()]` Retrieves a single element from the tensor at the specified indices. This method provides array-like indexing for the tensor. The number of indices provided must match the rank of the tensor, otherwise an error will occur at runtime. **Args:** * ​crd (`RuntimeTuple[S, element_type=element_type]`): The coordinate specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k). **Returns:** The element at the specified position with the tensor's data type. ### `__setitem__` `__setitem__(self, d0: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-1 tensor at the specified index. This method provides array-like element assignment for rank-1 tensors. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-2 tensor at the specified indices. This method provides array-like element assignment for rank-2 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, d2: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-3 tensor at the specified indices. This method provides array-like element assignment for rank-3 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​d2 (`Int`): The index along the third dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-4 tensor at the specified indices. This method provides array-like element assignment for rank-4 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​d2 (`Int`): The index along the third dimension. * ​d3 (`Int`): The index along the fourth dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. `__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, d4: Int, val: SIMD[dtype, element_layout.size()])` Sets a single element in a rank-5 tensor at the specified indices. This method provides array-like element assignment for rank-5 tensors. Performance: * Direct memory access with minimal overhead. * Memory access pattern follows the tensor's stride configuration. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Args:** * ​d0 (`Int`): The index along the first dimension. * ​d1 (`Int`): The index along the second dimension. * ​d2 (`Int`): The index along the third dimension. * ​d3 (`Int`): The index along the fourth dimension. * ​d4 (`Int`): The index along the fifth dimension. * ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position. ### `__add__` `__add__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Add a scalar value to each element of the tensor. Performs an elementwise addition operation, adding the scalar value to each element in the tensor. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the addition. * For in-place addition, use the `__iadd__` method instead (`+=` operator). **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to add to each element. **Returns:** A new tensor containing the results of the addition operation. `__add__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Add another tensor to this tensor elementwise. Performs an elementwise addition between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation creates a copy of the tensor before performing the addition. * For in-place addition, use the `__iadd__` method instead (`+=` operator). **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor. **Returns:** A new tensor containing the results of the addition operation. ### `__sub__` `__sub__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Subtract a scalar value from each element of the tensor. Performs an elementwise subtraction operation, subtracting the scalar value from each element in the tensor. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the subtraction. * For in-place subtraction, use the `__isub__` method instead (`-=` operator). **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element. **Returns:** A new tensor containing the results of the subtraction operation. `__sub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Subtract another tensor from this tensor elementwise. Performs an elementwise subtraction between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation creates a copy of the tensor before performing the subtraction. * For in-place subtraction, use the `__isub__` method instead (`-=` operator). **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor. **Returns:** A new tensor containing the results of the subtraction operation. ### `__mul__` `__mul__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Multiply each element of the tensor by a scalar value. Performs an elementwise multiplication operation, multiplying each element in the tensor by the scalar value. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the multiplication. * For in-place multiplication, use the `__imul__` method instead (`*=` operator). **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element. **Returns:** A new tensor containing the results of the multiplication operation. `__mul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Multiply this tensor with another tensor elementwise. Performs an elementwise multiplication (Hadamard product) between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Note: This is NOT a matrix multiplication operation. For matrix multiplication, use the appropriate matmul function instead. Performance: * This operation creates a copy of the tensor before performing the multiplication. * For in-place multiplication, use the `__imul__` method instead (`*=` operator). **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor. **Returns:** A new tensor containing the results of the elementwise multiplication. ### `__truediv__` `__truediv__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Divide each element of the tensor by a scalar value. Performs an elementwise division operation, dividing each element in the tensor by the scalar value. This operation creates a new tensor with the results. Performance: * This operation creates a copy of the tensor before performing the division. * For in-place division, use the `__itruediv__` method instead (`/=` operator). Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by. **Returns:** A new tensor containing the results of the division operation. `__truediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Divide this tensor by another tensor elementwise. Performs an elementwise division between this tensor and another tensor. This operation creates a new tensor with the results. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation creates a copy of the tensor before performing the division. * For in-place division, use the `__itruediv__` method instead (`/=` operator). Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by. **Returns:** A new tensor containing the results of the division operation. ### `__iadd__` `__iadd__(self, other: SIMD[dtype, 1])` Add a scalar value to each element of the tensor in-place. Performs an elementwise addition operation, adding the scalar value to each element in the tensor. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to add to each element. `__iadd__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Add another tensor to this tensor elementwise in-place. Performs an elementwise addition between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation modifies the tensor directly without creating a copy. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor. ### `__isub__` `__isub__(self, other: SIMD[dtype, 1])` Subtract a scalar value from each element of the tensor in-place. Performs an elementwise subtraction operation, subtracting the scalar value from each element in the tensor. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element. `__isub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Subtract another tensor from this tensor elementwise in-place. Performs an elementwise subtraction between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation modifies the tensor directly without creating a copy. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor. ### `__imul__` `__imul__(self, other: SIMD[dtype, 1])` Multiply each element of the tensor by a scalar value in-place. Performs an elementwise multiplication operation, multiplying each element in the tensor by the scalar value. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element. `__imul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Multiply this tensor with another tensor elementwise in-place. Performs an elementwise multiplication (Hadamard product) between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Note: This is NOT a matrix multiplication operation. For matrix multiplication, use the appropriate matmul function instead. Performance: * This operation modifies the tensor directly without creating a copy. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor. ### `__itruediv__` `__itruediv__(self, other: SIMD[dtype, 1])` Divide each element of the tensor by a scalar value in-place. Performs an elementwise division operation, dividing each element in the tensor by the scalar value. This operation modifies the tensor in-place. Performance: * This operation modifies the tensor directly without creating a copy. Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Args:** * ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by. `__itruediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Divide this tensor by another tensor elementwise in-place. Performs an elementwise division between this tensor and another tensor. This operation modifies the tensor in-place. Limited broadcasting is supported: * For tensors of the same rank, shapes must match exactly. * For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must match the corresponding dimension of the rank-2 tensor. Performance: * This operation modifies the tensor directly without creating a copy. Notes: * Division by zero will result in undefined behavior or errors depending on the dtype. * For integer dtypes, this performs integer division. **Parameters:** * ​other\_layout (`Layout`): The layout of the other tensor. **Args:** * ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by. ### `copy` `copy(self) -> Self` Explicitly copy the other `LayoutTensor`. **Returns:** A copy of the value. ### `bitcast` `bitcast[new_type: DType, /, address_space: AddressSpace = address_space, element_layout: Layout = element_layout](self) -> LayoutTensor[new_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Bitcast the underlying pointer to a new data type. **Parameters:** * ​new\_type (`DType`): The new data type it is casting to. * ​address\_space (`AddressSpace`): The address space of the returned `LayoutTensor`. * ​element\_layout (`Layout`): The element layout of the returned `LayoutTensor`. **Returns:** A new `LayoutTensor` with the same memory location but with the specified data type, address space, and element layout. ### `origin_cast` `origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Changes the origin or mutability of a pointer. **Parameters:** * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): Origin of the destination pointer. **Returns:** A new `LayoutTensor` object with the same type and the same address, as the original `LayoutTensor`, and the new specified mutability and origin. ### `address_space_cast` `address_space_cast[address_space: AddressSpace = address_space](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Changes the origin or mutability of a pointer. **Parameters:** * ​address\_space (`AddressSpace`): The new address space. **Returns:** A new `LayoutTensor` object with the same type and origin as the original `LayoutTensor`, and the new specified address\_space. ### `get_immutable` `get_immutable(self) -> LayoutTensor[dtype, layout, (muttoimm origin._mlir_origin), address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Return an immutable version of this tensor. **Returns:** A `LayoutTensor` covering the same elements, but without mutability. ### `__exp__` `__exp__(self) -> Self` Computes element-wise exponential function. Returns a new tensor containing the [element-wise exponential](/mojo/stdlib/math/math/exp/) of the input tensor. **Returns:** A new tensor containing the element-wise exponential. ### `load` `load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]` Load a SIMD vector from the tensor at the specified 2D coordinates. Performs a vectorized load operation from the tensor's memory, retrieving `width` consecutive elements starting at position (m, n). This method enables efficient SIMD operations on tensor data. Performance: * Uses unaligned memory access which may be slower on some architectures. * For aligned access, use `aligned_load` instead when data alignment is guaranteed. * The load operation is optimized based on the tensor's memory layout. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. * The elements are loaded according to the tensor's stride configuration. **Parameters:** * ​width (`Int`): The number of elements to load into the SIMD vector. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension). * ​n (`Int`): The column index (second dimension). **Returns:** A SIMD vector containing 'width' consecutive elements from the tensor. ### `prefetch` `prefetch(self, m: Int, n: Int)` Prefetch tensor data at the specified 2D coordinates into cache. Issues a software prefetch hint to the processor to load the data at position (m, n) into the cache hierarchy. This can improve performance by reducing memory latency for subsequent accesses to the same location. Performance: * Prefetching is a performance hint and does not guarantee data will be cached. * Most effective when issued sufficiently ahead of the actual data access. * Uses high locality prefetch to the data cache, optimized for data that will be accessed multiple times. * Can reduce memory access latency by 50-90% when used correctly. Notes: * Excessive prefetching can pollute the cache and degrade performance. * Most beneficial for predictable access patterns that would otherwise cause cache misses. * No operation is performed on the prefetched data. **Args:** * ​m (`Int`): The row index (first dimension). * ​n (`Int`): The column index (second dimension). ### `aligned_load` `aligned_load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]` Load a SIMD vector with alignment guarantees from the tensor. Performs an aligned vectorized load operation from the tensor's memory, retrieving `width` consecutive elements starting at position (m, n). The alignment is automatically calculated based on the SIMD width and dtype. Performance: * Uses aligned memory access which is faster than unaligned access on most architectures. * The alignment is automatically calculated based on the SIMD width and dtype. * Can be up to 2x faster than unaligned loads on architectures that require alignment. Notes: * The caller must ensure that the memory at (m, n) is properly aligned. Misaligned access with this method may cause hardware exceptions on some architectures. * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. **Parameters:** * ​width (`Int`): The number of elements to load into the SIMD vector. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension). * ​n (`Int`): The column index (second dimension). **Returns:** A SIMD vector containing 'width' consecutive elements from the tensor. ### `store` `store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])` Store a SIMD vector to the tensor at the specified 2D coordinates. Performs a vectorized store operation to the tensor's memory, writing 'width' consecutive elements starting at position (m, n). This method enables efficient SIMD operations on tensor data. Performance: * Uses unaligned memory access which may be slower on some architectures. * For aligned access, use aligned\_store instead when data alignment is guaranteed. * The store operation is optimized based on the tensor's memory layout. Notes: * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. * The elements are stored according to the tensor's stride configuration. * This operation modifies the tensor's data in-place. **Parameters:** * ​width (`Int`): The number of elements in the SIMD vector to store. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension) where the store operation begins. * ​n (`Int`): The column index (second dimension) where the store operation begins. * ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor. ### `aligned_store` `aligned_store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])` Store a SIMD vector with alignment guarantees to the tensor. Performs an aligned vectorized store operation to the tensor's memory, writing `width` consecutive elements starting at position (m, n). The alignment is automatically calculated based on the SIMD width and dtype. Performance: * Uses aligned memory access which is faster than unaligned access on most architectures. * The alignment is automatically calculated based on the SIMD width and dtype. * Can be up to 2x faster than unaligned stores on architectures that require alignment. * Particularly important for streaming stores that bypass the cache. Notes: * The caller must ensure that the memory at (m, n) is properly aligned. Misaligned access with this method may cause hardware exceptions on some architectures. * No bounds checking is performed. Accessing out-of-bounds indices will result in undefined behavior. * This operation modifies the tensor's data in-place. **Parameters:** * ​width (`Int`): The number of elements in the SIMD vector to store. Should match the target hardware's vector width for optimal performance. **Args:** * ​m (`Int`): The row index (first dimension) where the store operation begins. * ​n (`Int`): The column index (second dimension) where the store operation begins. * ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor. ### `size` `size(self) -> Int` Get the total number of elements that the tensor can contain. **Returns:** The total number of elements that can be stores in the tensor. ### `stack_allocation` `static stack_allocation[*, alignment: Int = alignment]() -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Allocates stack memory for a `LayoutTensor` with a fully static layout. Creates a new `LayoutTensor` instance with memory allocated on the stack rather than the heap. This provides deterministic memory management and potentially better performance for tensors with known sizes at compile time. Performance: * Stack allocation is typically faster than heap allocation. * Proper alignment can significantly improve memory access performance, especially for vectorized operations. * No dynamic memory management overhead (no malloc/free calls). Notes: * Only works with tensors that have fully static layouts known at compile time. * Stack memory is limited, so this should only be used for reasonably sized tensors. * The allocated memory is automatically freed when the function returns. **Constraints:** * The layout must be fully static (all dimensions known at compile time). * The alignment must be a multiple of the tensor's minimum required alignment. **Parameters:** * ​alignment (`Int`): Memory alignment value for the allocation in bytes. Must be a multiple of the tensor's minimum required alignment. Default is the tensor's natural alignment based on its data type and layout. **Returns:** A new `LayoutTensor` instance with memory allocated on the stack. ### `shape` `static shape[idx: Int]() -> Int` Returns the size of the tensor along the specified dimension. Provides static access to the tensor's shape information. This method returns the size of a specific dimension without requiring an instance of the tensor, as the shape is part of the tensor's static type information. Performance: * This is a compile-time operation with no runtime cost when used with static dimensions. Notes: * This is a static method that operates on the tensor's type information, not on a specific tensor instance. **Parameters:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 3D tensor with shape \[10, 20, 30]: * `shape[0]()` returns 10 (first dimension). * `shape[1]()` returns 20 (second dimension). * `shape[2]()` returns 30 (third dimension). **Returns:** The size of the tensor along the specified dimension as an integer. ### `stride` `static stride[idx: Int]() -> Int` Returns the memory stride of the tensor along the specified dimension. Provides static access to the tensor's stride information. The stride represents the number of elements to skip in memory to move one position along a particular dimension. This method returns the stride without requiring an instance of the tensor, as the stride is part of the tensor's static type information. Performance: * This is a compile-time operation with no runtime cost when used with static dimensions. * Understanding stride patterns is crucial for optimizing memory access patterns in performance-critical code. Notes: * Strides depend on the memory layout (row-major, column-major, or custom). * For non-contiguous tensors (e.g., tensor slices), strides may not follow a simple pattern. **Parameters:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 2D tensor with shape \[10, 20] and row-major layout: * `stride[0]()` might return 20 (moving one row requires skipping 20 elements). * `stride[1]()` might return 1 (moving one column requires skipping 1 element). **Returns:** The memory stride of the tensor along the specified dimension as an integer. ### `dim` `dim[idx: Int](self) -> Int` Returns the runtime dimension size of the tensor along the specified axis. Unlike the static `shape` method, this instance method provides access to the tensor's actual dimension sizes at runtime, which is necessary for tensors with dynamic shapes or when working with tensor slices. Performance: * This is a run-time operation that accesses the tensor's runtime layout information. * For static dimensions known at compile time, prefer the static `shape` method when possible for better performance. Notes: * This method works with both static and dynamic dimensions. * For tensors with masked or partial views, this returns the actual size of the view, not the original tensor. **Constraints:** * Only works with tensors that have depth-1 layouts (no nested shapes). **Parameters:** * ​idx (`Int`): The dimension index to query (0-based). For example, in a 3D tensor with shape `[10, 20, 30]`: * `dim(0)` returns 10 (first dimension). * `dim(1)` returns 20 (second dimension). * `dim(2)` returns 30 (third dimension). **Returns:** The size of the tensor along the specified dimension as an integer. ### `coalesce` `coalesce(self) -> LayoutTensor[dtype, coalesce(layout, False), origin, address_space=address_space, element_layout=element_layout]` Creates a tensor with a coalesced memory layout from this tensor. Coalescing a tensor's layout means reorganizing its memory representation to be as contiguous as possible, which can improve memory access patterns and performance. This operation does not move or copy data; it only changes how the same memory is interpreted. Performance: * Coalesced layouts typically provide better cache utilization and memory access patterns. * This operation is zero-cost at runtime as it only changes the layout information, not the actual data. * Particularly beneficial before operations that perform sequential memory access or vectorized operations. Notes: * The coalesced tensor shares the same memory as the original tensor, so modifications to one will affect the other. * The shape of the tensor remains the same, only the stride information is optimized. * For already optimally coalesced tensors, this operation has no effect. **Returns:** A tensor with the same data but with a coalesced memory layout. The returned tensor has type `LayoutTensor` with the same dtype but with a coalesced layout. ### `tile_type` `static tile_type[*tile_sizes: Int](*tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]` Returns a the type of a tile view of the tensor with specified dimensions and coordinates. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. **Args:** * ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. **Returns:** The type of a view into the original tensor representing the specified tile. ### `tile` `tile[*tile_sizes: Int](self, *tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]` Extract a tile (sub-tensor) from this tensor with specified dimensions and position. Tiling is a fundamental operation for high-performance tensor computations that divides a tensor into smaller blocks for better cache locality and parallelism. This method extracts a specific tile at the given coordinates without copying data. Example: For a 4×4 tensor with values: ``` [1 2 3 4] [2 3 4 5] [5 4 3 2] [1 1 1 1] ``` `tile[2, 2](1, 0)` will extract the tile: ``` [5 4] [1 1] ``` Performance: * Creates a view without copying data, making it very efficient. * Optimized for both static and dynamic layouts with different code paths. * Properly handles edge cases where tiles may be partially outside the tensor. * Maintains stride information for efficient memory access within the tile. Notes: * The resulting tile is a view into the original tensor, so modifications to the tile will affect the original tensor. * For tiles at the edges of the tensor, the actual dimensions may be smaller than the requested tile\_sizes if masking is enabled. * The implementation automatically selects between static and dynamic tiling based on the tensor's layout properties. **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. For example, in a 2D tensor, `tile[32, 32]` creates 32×32 tiles. **Args:** * ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. For example, `tile[32, 32](1, 2)` extracts the tile at position (1, 2) in the grid of 32×32 tiles. **Returns:** A view into the original tensor representing the specified tile. ### `tiled_iterator` `tiled_iterator[*tile_sizes: Int, *, axis: Int = 0](self, *tile_coords: Int) -> LayoutTensorIter[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, axis=OptionalReg[Int]({:_stdlib::_builtin::_int::_Int axis, 0}), layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]` Create an iterator that traverses tiles along a specified axis. This method creates an iterator that allows efficient traversal of tiles within a tensor. The iterator starts at the specified tile coordinates and can move along the specified axis, providing access to consecutive tiles. Performance: * Provides efficient sequential access to tiles with good cache locality. * Optimized for both static and dynamic layouts with different code paths. * Maintains stride information for efficient memory access within each tile. * Properly handles edge cases where tiles may be partially outside the tensor. Notes: * The iterator provides views into the original tensor, so modifications through the iterator will affect the original tensor. * For tiles at the edges of the tensor, the actual dimensions may be smaller than the requested tile\_sizes if masking is enabled. * The iterator is not circular by default, meaning it will not wrap around when reaching the end of the tensor along the iteration axis. * The implementation automatically selects between static and dynamic tiling based on the tensor's layout properties. Example: ```mojo var iter = tensor.tiled_iterator[16, 16, axis=0](0, 0) for i in range(num_tiles_along_axis): var tile = iter.get() # Process tile iter.next() ``` **Parameters:** * ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the tensor. For example, in a 2D tensor, `tiled_iterator[32, 32]` creates an iterator over 32×32 tiles. * ​axis (`Int`): The axis along which the iterator will traverse. Default is 0 (first dimension). For example, with axis=0, the iterator will move vertically through tiles. **Args:** * ​\*tile\_coords (`Int`): The starting coordinates of the tile where iteration begins. **Returns:** A `LayoutTensorIter` that can be used to traverse tiles along the specified axis. ### `split` `split[count: Int, axis: Int = 0](self) -> StaticTuple[LayoutTensor[dtype, _compute_tile_layout[::Int,::Int]()[0], origin, address_space=address_space, element_layout=element_layout, alignment=alignment], count]` Split the `LayoutTensor` along a axis and return a `StaticTuple` of `LayoutTensor`. **Parameters:** * ​count (`Int`): Number of portion to split. * ​axis (`Int`): The axis where the split is applied to. **Returns:** A `StaticTuple` containing `count` `LayoutTensors`, each representing an equal-sized partition of the original tensor along the specified axis. Each partition has the same data type and memory characteristics as the original tensor, but with a reduced size along the split axis. `split[axis: Int = 0, alignment: Int = 1](self, count: Int, idx: Int) -> LayoutTensor[dtype, layout.make_shape_unknown[::Int](), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Retrieve a specific partition of the tensor after splitting along a specified axis. This method divides the tensor into 'count' partitions along the specified axis and returns the partition at index 'idx'. The partitioning is done with alignment considerations to optimize memory access patterns. Unlike the overloaded split method that returns all partitions, this method returns only a single partition, making it more memory-efficient for cases where only one partition is needed at a time. Notes: * The shape along the split axis becomes unknown at compile time. * Only works with dimensions that have statically known sizes. * The last partition may be smaller than others if the dimension size is not evenly divisible by `count`. * Partition sizes are aligned up to the specified alignment value, which can improve performance for vectorized operations. Performance: * Uses aligned partitioning to improve memory access patterns. * Avoids creating all partitions in memory, reducing memory usage. * Maintains the original tensor's stride information for efficient element access within the partition. **Constraints:** * The dimension being split must have a statically known size. * Cannot split dimensions with unknown or dynamic sizes. **Parameters:** * ​axis (`Int`): The axis along which to split the tensor. Defaults to 0 (first dimension). * ​alignment (`Int`): Memory alignment value for the partition size. Defaults to 1. **Args:** * ​count (`Int`): The number of partitions to divide the tensor into. * ​idx (`Int`): The index of the partition to return (0-based). **Returns:** A `LayoutTensor` representing the requested partition. ### `distribute` `distribute[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt) -> LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False]` Distribute tensor workload across multiple threads in a structured pattern. This method partitions a tensor across multiple threads for parallel processing, assigning each thread a specific portion of the tensor. The distribution pattern is determined by the threads\_layout parameter, which defines the logical arrangement of threads. Example: For a 4×4 tensor distributed across 4 threads in a 2×2 grid: * Thread 0 might get the top-left quadrant * Thread 1 might get the top-right quadrant * Thread 2 might get the bottom-left quadrant * Thread 3 might get the bottom-right quadrant If axis=0 is specified with the same setup: * Thread 0 and Thread 2 would get the same data (left half) * Thread 1 and Thread 3 would get the same data (right half) Performance: * Creates a view without copying data, making it very efficient for parallel processing. * The swizzle parameter can significantly improve cache locality and memory access patterns. * Optimized for both static and dynamic layouts with different code paths. Notes: * The resulting tensor is a view into the original tensor, so modifications will affect the original tensor. * For optimal performance, the `threads_layout` should match the hardware's thread organization (e.g., warp/wavefront size and shape). * When using swizzling, carefully consider the memory access patterns to avoid cache thrashing or bank conflicts. * This function is particularly useful for GPU programming where threads are organized in structured grids. **Constraints:** * For dynamic layouts, the shape must be known at runtime and the threads\_layout must be fully static. **Parameters:** * ​threads\_layout (`Layout`): Defines the logical arrangement of threads (e.g., 2×2 grid of 4 threads). This layout determines how the tensor is partitioned. * ​axis (`OptionalReg[Int]`): Optional. If specified, restricts distribution to only this axis. For example, with axis=0 in a 2D thread layout, threads that differ only in their second coordinate will receive the same data. * ​swizzle (`OptionalReg[Swizzle]`): Optional. A function that remaps the distribution pattern to improve memory access patterns or cache locality. * ​submode\_axis (`OptionalReg[Int]`): Optional. Specifies an axis for specialized distribution modes. **Args:** * ​thread\_id (`UInt`): The ID of the current thread (0-based). **Returns:** A view into the original tensor representing the portion assigned to this thread. ### `vectorize_type` `static vectorize_type[*vector_shape: Int]() -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Returns the type of a vectorized view of the tensor with specified vector dimensions. **Parameters:** * ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of the tensor. **Returns:** The type of a view into the original tensor with a vectorized layout. ### `vectorize` `vectorize[*vector_shape: Int](self) -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Reshape a tensor into a vectorized form for efficient SIMD operations. This method transforms the tensor's logical layout to enable efficient vectorized processing, treating blocks of elements as vector units. The transformation is particularly useful for SIMD (Single Instruction Multiple Data) operations and hardware acceleration. Example: For a 16×16 tensor, `vectorize[4, 4]` will produce a 4×4 tensor where each element represents a 4×4 block from the original tensor. Performance: * Creates a view without copying data, making it very efficient. * Enables hardware-accelerated vector operations on blocks of data. * Improves cache locality by grouping related elements together. * Particularly beneficial for operations that can leverage SIMD instructions. Notes: * The tensor dimensions must be divisible by the corresponding vector dimensions. * For dimensions with unknown size, the corresponding vector dimension must be 1. * The resulting tensor has the same data but a different logical organization. * Modifications to the vectorized tensor affect the original tensor. * This transformation is particularly useful for GPU and vector processor optimizations. **Constraints:** * Each tensor dimension must be divisible by the corresponding vector dimension. * Vector dimensions must be smaller than or equal to the corresponding tensor dimensions. * For dimensions with unknown size, the vector dimension must be 1. **Parameters:** * ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of the tensor. or example, in a 2D tensor, `vectorize[4, 4]` treats 4×4 blocks as vector units. **Returns:** A view of the tensor with a vectorized layout, where each element in the resulting tensor represents a vector of elements from the original tensor. ### `slice` `slice[d0_slice: Slice, d1_slice: Slice](self) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Extract a slice from a rank-2 tensor using slice objects. This method creates a view into a subset of the tensor defined by the slice specifications for each dimension. The slice is a continuous region of the tensor with no gaps (step size must be 1). Example: For a 4×4 tensor with values: ``` [1 2 3 4] [5 6 7 8] [9 10 11 12] [13 14 15 16] ``` ```mojo slice[Slice(1, 3), Slice(0, 2)] ``` will extract: ``` [5 6] [9 10] ``` Performance: * Creates a view without copying data, making it very efficient. * Maintains the original tensor's stride information for efficient memory access. * Zero-cost abstraction at runtime when used with compile-time constant slices. Notes: * The slice is a view into the original tensor, so modifications to the slice will affect the original tensor. * Only supports rank-2 tensors. For higher-rank tensors, use the overloaded version with slice indices. * The step size must be 1 (no gaps allowed in the slice). * Slice bounds are not checked at runtime; accessing out-of-bounds indices will result in undefined behavior. **Constraints:** * Only works with rank-2 tensors. **Parameters:** * ​d0\_slice (`Slice`): Slice specification for the first dimension (rows). Defines the start and end indices for the slice along this dimension. * ​d1\_slice (`Slice`): Slice specification for the second dimension (columns). Defines the start and end indices for the slice along this dimension. **Returns:** A view into the original tensor representing the specified slice. `slice[d0_slice: Slice, d1_slice: Slice, slice_indices: IndexList[2], __offset_dims: Int = (layout.rank() + -2)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice, slice_indices.__getitem__[::Indexer](0), slice_indices.__getitem__[::Indexer](1)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Extract a 2D slice from a higher-rank tensor at specific indices. This method creates a view into a 2D subset of a higher-rank tensor: Selecting two dimensions to slice using the slice\_indices parameter. Applying slice specifications to those dimensions. Using fixed offsets for all other dimensions. Example: Given a 3×4×5 tensor, the following example extracts a 2×2 slice from dimensions 0 and 2, with dimension 1 fixed at index 1. ```mojo slice = t.slice[Slice(1, 3), Slice(0, 2), IndexList[2](0, 2)](1) ``` Performance: * Creates a view without copying data, making it very efficient. * Maintains the original tensor's stride information for efficient memory access. * Zero-cost abstraction at runtime when used with compile-time constant slices. Notes: * The slice is a view into the original tensor, so modifications to the slice will affect the original tensor. * The slice indices must be ordered (e.g., \[0, 2] is valid, \[2, 0] is not). * The step size must be 1 (no gaps allowed in the slice). * Slice bounds are not checked at runtime; accessing out-of-bounds indices will result in undefined behavior. **Constraints:** * Slice step size must be 1 (no gaps). * Slice indices must be ordered (ascending). * Tensor rank must be at least 2. **Parameters:** * ​d0\_slice (`Slice`): Slice specification for the first selected dimension. * ​d1\_slice (`Slice`): Slice specification for the second selected dimension. * ​slice\_indices (`IndexList[2]`): Indices of the two dimensions to slice (must be ordered). * ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed dimensions. **Args:** * ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced. **Returns:** A 2D view into the original tensor representing the specified slice. ### `slice_1d` `slice_1d[d0_slice: Slice, slice_indices: IndexList[1], __offset_dims: Int = (layout.rank() + -1)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, slice_indices.__getitem__[::Indexer](0)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Extract a 1D slice from a higher-rank tensor at a specific index. This method creates a view into a 1D subset of a higher-rank tensor by: 1. Selecting one dimension to slice using the slice\_indices parameter 2. Applying a slice specification to that dimension 3. Using fixed offsets for all other dimensions Example: For a 3×4×5 tensor, the following example extracts a 1D slice from dimension 0, with dimensions 1 and 2 fixed at indices 1 and 2: ```mojo slice_1d[Slice(1, 3), IndexList[1](0)](1, 2)` ``` Performance: * Creates a view without copying data, making it very efficient. * Maintains the original tensor's stride information for efficient memory access. * Zero-cost abstraction at runtime when used with compile-time constant slices. Notes: * The slice is a view into the original tensor, so modifications to the slice will affect the original tensor. * The step size must be 1 (no gaps allowed in the slice). * Slice bounds are not checked at runtime; accessing out-of-bounds indices will result in undefined behavior. * This function exists as a workaround for compiler limitations with overloading. **Constraints:** * Slice step size must be 1 (no gaps). * Tensor rank must be at least 1. **Parameters:** * ​d0\_slice (`Slice`): Slice specification for the selected dimension. * ​slice\_indices (`IndexList[1]`): Index of the dimension to slice. * ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed dimensions. **Args:** * ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced. **Returns:** A 1D view into the original tensor representing the specified slice. ### `transpose` `transpose[M: Int = shape[::Int](), N: Int = shape[::Int]()](self) -> LayoutTensor[dtype, composition(layout, __init__[::Origin[::Bool(__init__[::Origin[::Bool(IntTuple(N), IntTuple(M), Tuple()), __init__[::Origin[::Bool(IntTuple(M), IntTuple(1), Tuple()))), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Create a transposed view of a rank-2 tensor. This method creates a view of the tensor with its dimensions swapped, effectively converting rows to columns and columns to rows. The transposition is performed without copying data, by adjusting the tensor's layout information. Example: For a 2×3 tensor with values: ``` [1 2 3] [4 5 6] ``` `transpose()` will produce a 3×2 tensor: ``` [1 4] [2 5] [3 6] ``` Performance: * Creates a view without copying data, making it very efficient. * The operation is zero-cost at runtime as it only changes the layout information. * Memory access patterns may be less efficient in the transposed view due to non-contiguous memory access, especially for row-major storage. Notes: * The transposed tensor shares the same memory as the original tensor, so modifications to one will affect the other. * Only works with rank-2 tensors. * For optimal performance when repeatedly accessing the transposed data, consider creating a physical copy with the transposed layout. **Constraints:** * Only works with rank-2 tensors. **Parameters:** * ​M (`Int`): The size of the first dimension (rows) of the original tensor. Defaults to the static shape value of the first dimension. * ​N (`Int`): The size of the second dimension (columns) of the original tensor. Defaults to the static shape value of the second dimension. **Returns:** A view of the tensor with dimensions transposed (rows become columns and vice versa). ### `reshape` `reshape[dst_layout: Layout](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Create a view of the tensor with a different shape. This method creates a view of the tensor with a new shape, without changing the underlying data. The total number of elements must remain the same. Example: For a 2×6 tensor, `reshape[Layout((3, 4))]()` produces a 3×4 tensor with the same elements in row-major order. Performance: * Creates a view without copying data, making it very efficient. * The operation is zero-cost at runtime as it only changes the layout information. * Memory access patterns may change, potentially affecting performance depending on the original and target layouts. Notes: * The reshaped tensor shares the same memory as the original tensor, so modifications to one will affect the other. * The total number of elements must remain the same after reshaping. * The reshape operation assumes a row-major (C-style) memory layout. * For tensors with complex strides or non-contiguous memory, reshaping may not produce the expected results. * Masked tensors cannot be reshaped. **Constraints:** * Cannot reshape masked tensors. * The total number of elements must be the same in both layouts. **Parameters:** * ​dst\_layout (`Layout`): The target layout for the reshaped tensor. Must have the same total number of elements as the original tensor. **Returns:** A view of the tensor with the new shape specified by dst\_layout. ### `composition` `composition[rhs_layout: Layout, dst_layout: Layout = composition(layout, rhs_layout)](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Create a view of the tensor with a composed layout. This method creates a view of the tensor with a new layout that is the composition of the original layout with another layout. Layout composition allows for complex transformations of the tensor's logical structure without copying data. Example: For a 4×4 tensor with a standard row-major layout, composing with a layout that represents a 2×2 tiling would result in a tensor that logically views the data as 2×2 blocks. Performance: * Creates a view without copying data, making it very efficient. * The operation is zero-cost at runtime as it only changes the layout information. * Can be used to optimize memory access patterns for specific algorithms. Notes: * The composed tensor shares the same memory as the original tensor, so modifications to one will affect the other. * Layout composition is a powerful tool for expressing complex data transformations like tiling, transposition, and reshaping in a unified framework. * Understanding the mathematical properties of layout composition is important for correctly using this function. **Constraints:** * The layouts must be compatible for composition. * The total number of elements must remain the same after composition. **Parameters:** * ​rhs\_layout (`Layout`): The layout to compose with the tensor's current layout. * ​dst\_layout (`Layout`): The resulting layout after composition. Defaults to the composition of the tensor's layout with rhs\_layout. **Returns:** A view of the tensor with the composed layout. ### `distance` `distance(self, addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> SIMD[linear_idx_type, 1]` Calculate the element-wise distance between this tensor's pointer and another pointer. This method computes the number of elements (not bytes) between the tensor's pointer and the provided address. This is useful for determining offsets within a larger memory allocation or for pointer arithmetic operations. Example: If `tensor.ptr` points to an element at index 100 in a buffer, and `addr` points to element at index 50, then `distance(addr)` returns 50. Performance: * This is a lightweight operation that only involves pointer arithmetic. * The operation is optimized based on the address space, using smaller integer types for shared memory to improve efficiency. Notes: * The distance is calculated in elements, not bytes. * The result can be positive or negative depending on the relative positions of the pointers. * This function is particularly useful for GPU programming where understanding memory offsets is critical for performance. * Care should be taken when using this with pointers from different allocations, as the result would be meaningless. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): The target pointer to calculate the distance to. **Returns:** The number of elements between this tensor's pointer and the provided address. The result is of type `_uint_dtype`. `distance[_layout: Layout, _uint_dtype: DType = _get_unsigned_type(_layout, address_space)](self, src: LayoutTensor[dtype, _layout, origin, address_space=address_space]) -> SIMD[_uint_dtype, 1]` Calculate the element-wise distance between this tensor and another tensor. This method computes the number of elements (not bytes) between this tensor's pointer and another tensor's pointer. This is useful for determining the relative positions of tensors within a larger memory allocation. Example: If tensor1 points to element at index 100 in a buffer, and tensor2 points to element at index 50, then `tensor1.distance(tensor2)` would return 50. Performance: * This is a lightweight operation that only involves pointer arithmetic. * The operation is optimized based on the address space and layout, using appropriate integer types for efficiency. Notes: * The distance is calculated in elements, not bytes. * The result can be positive or negative depending on the relative positions of the tensors. * This function is particularly useful for GPU programming where understanding memory offsets is critical for performance. * Both tensors must be in the same address space for the result to be meaningful. * This overload is more type-safe than the pointer-based version as it ensures the tensors have compatible data types and address spaces. **Parameters:** * ​\_layout (`Layout`): The layout of the source tensor. * ​\_uint\_dtype (`DType`): The unsigned integer type to use for the result. Automatically determined based on the layout and address space. **Args:** * ​src (`LayoutTensor[dtype, _layout, origin, address_space=address_space]`): The source tensor to calculate the distance to. **Returns:** The number of elements between this tensor's pointer and the source tensor's pointer. The result is of type \_uint\_dtype. ### `copy_from` `copy_from(self, other: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Copy data from another tensor to this tensor. This method performs an element-by-element copy from the source tensor to this tensor, respecting the layouts of both tensors. The copy operation handles different memory layouts correctly, ensuring that elements are copied to their proper positions regardless of how the data is arranged in memory. * Both tensors must have statically known shapes. * The total number of elements must be the same in both tensors. * The element sizes must match between the tensors. Example: ```mojo from layout import LayoutTensor, Layout var src = LayoutTensor[DType.float32, Layout((2, 3))]() var dst = LayoutTensor[DType.float32, Layout((3, 2))]() dst.copy_from(src) # Copies all elements from src to dst ``` Performance: * Performs element-by-element copying, which may be less efficient than vectorized or bulk memory operations. * The copy respects the memory layout of both tensors, which may involve non-contiguous memory access patterns. * For optimal performance with large tensors, consider using specialized copy functions that can leverage hardware acceleration. Notes: * Both tensors must have statically known shapes. * The total number of elements must be the same in both tensors. * The element sizes must match between the tensors. * This function handles different memory layouts correctly, making it suitable for copying between tensors with different shapes or strides. * The copy is performed element by element, not as a bulk memory copy. **Args:** * ​other (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from. Must have the same total number of elements as this tensor. ### `copy_from_async` `copy_from_async[is_masked: Bool = False, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0)](self, src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_idx_bound: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), base_offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0))` Asynchronously copy data from another tensor to this tensor using GPU hardware. This method performs an asynchronous copy from the source tensor to this tensor using GPU hardware acceleration. It's specifically designed for copying data from global memory to shared memory in GPU kernels, leveraging hardware-specific asynchronous copy mechanisms for improved performance. Example: ```mojo from layout import LayoutTensor, Layout, AddressSpace var global_data = LayoutTensor[DType.float32, Layout((128, 128)), address_space=AddressSpace.GLOBAL]() var shared_data = LayoutTensor[DType.float32, Layout((32, 32)), address_space=AddressSpace.SHARED]() shared_data.copy_from_async(global_data) ``` Performance: * Uses hardware-accelerated asynchronous copy mechanisms for optimal performance. * Particularly efficient for copying data from global memory to shared memory in GPU kernels. * Supports vectorized copies for 4, 8, or 16-byte elements for better throughput. * Can bypass L1 cache with appropriate eviction policies for specific access patterns. * Swizzling can improve memory access patterns and reduce bank conflicts. Notes: * For vectorized copies, both tensors must have contiguous element layouts. * Asynchronous copies allow computation to overlap with memory transfers. * A synchronization barrier is required before using the copied data. **Constraints:** * Destination must be in shared memory. * Source and destination data types must match. * Element size must be 4, 8, or 16 bytes. * Destination tensor must have a static layout. **Parameters:** * ​is\_masked (`Bool`): Whether to perform a masked copy, where elements outside the `src_idx_bound` are not copied or filled with zeros. * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination indices, which can improve memory access patterns. * ​fill (`Fill`): Fill policy for elements that are not copied (only used with masked copies). * ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data. **Args:** * ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from. * ​src\_idx\_bound (`SIMD[linear_idx_type, 1]`): For masked copies, the upper bound index for valid source elements. * ​base\_offset (`SIMD[linear_idx_type, 1]`): Base offset for swizzling calculations. ### `fill` `fill(self: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], val: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Fill the entire tensor with a single value. This method sets all elements of the tensor to the specified value. It works with both statically and dynamically shaped tensors, filling all elements regardless of the tensor's layout. Example: ```mojo from layout import LayoutTensor, Layout var tensor = LayoutTensor[DType.float32, Layout((3, 4))]() tensor.fill(0.0) # Sets all elements to 0.0 ``` Performance: * For statically known layouts, the fill operation is unrolled at compile time. * For dynamic layouts, a runtime loop is used. * No vectorization is applied, so performance may be suboptimal for large tensors. * Consider using hardware-specific fill operations for better performance with large tensors. Notes: * The tensor must be mutable (`mut=True`). * The fill operation respects the tensor's layout, filling all elements regardless of how they are arranged in memory. * This method can be used with tensors of any rank and shape. * For tensors with `element_layout`, all elements within each logical element are filled with the same value. **Args:** * ​val (`SIMD[dtype, 1]`): The value to fill the tensor with. Must be of the same data type as the tensor. **Returns:** The tensor itself (self), allowing for method chaining. ### `__str__` `__str__(self) -> String` Convert the tensor to a string representation. This method converts the tensor to a human-readable string representation by writing its contents to a string. It delegates to the `write_to` method which formats the tensor appropriately based on its rank and shape. **Returns:** A string representation of the tensor. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Format and write the tensor's contents to a writer. This method formats the tensor's contents and writes them to the provided writer. For 2D tensors, it formats the output in a 2D grid. For tensors of other ranks, it prints all values in column-major coordinate order. Example: ```mojo from layout import LayoutTensor, Layout var tensor = LayoutTensor[DType.float32, Layout((2, 3))]() tensor.fill(1.0) print(tensor) # Internally calls `write_to` with a StringWriter ``` Output for a 2×3 tensor: ``` [[1.0, 1.0, 1.0], [1.0, 1.0, 1.0]] ``` Notes: * For 2D tensors, the output is formatted as a 2D grid with rows and columns. * For tensors of other ranks, values are printed in column-major coordinate order. * Empty tensors (size 0) produce no output. * This method is used by the `__str__` method to convert the tensor to a string. * The formatting is designed for human readability rather than parsing. * For large tensors, the output may be truncated to avoid excessive output. **Parameters:** * ​W (`Writer`): The writer type that will receive the formatted output. **Args:** * ​writer (`W`): The writer instance to write the formatted output to. --- ## LayoutTensorBuild `@register_passable(trivial)` `struct LayoutTensorBuild[dtype: DType, *, __layout: Layout = __init__[::Origin[::Bool(IntTuple(1)), __layout_init: Bool = False, __address_space: AddressSpace = AddressSpace(0), __layout_int_type: DType = _get_layout_type(__layout, __address_space), __index_type: DType = _get_index_type(__layout, __address_space), __circular: Bool = False]` Tensor layout builder providing a fluent interface for constructing tensors with various layouts. ## Parameters * ​dtype (`DType`): Data type of tensor elements. * ​\_\_layout (`Layout`): The tensor's memory layout. * ​\_\_layout\_init (`Bool`): Whether the layout has been initialized. * ​\_\_address\_space (`AddressSpace`): Memory space (generic, shared, local). * ​\_\_layout\_int\_type (`DType`): Layout index type. * ​\_\_index\_type (`DType`): Type used for indexing. * ​\_\_circular (`Bool`): Whether tensor has circular indexing semantics. ## Fields * ​runtime\_layout (`RuntimeLayout[__layout, element_type=__layout_int_type, linear_idx_type=__index_type]`): Runtime representation of the tensor's layout. This field stores the layout information that can be manipulated at runtime, particularly important for tensors with dynamic dimensions. It encapsulates: * The static layout template from `__layout` parameter * The bit width for index calculations * The appropriate index type based on address space ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initializes a new `LayoutTensorBuild` instance with default values. ### `row_major` `row_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=row_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]` Creates a row-major layout using compile-time dimensions. **Parameters:** * ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor. Each value represents the size of a dimension. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim), __layout_init=True]` Creates a row-major 2D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim), __layout_init=True]` Creates a row-major 3D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim), __layout_init=True]` Creates a row-major 4D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. `row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim, dim), __layout_init=True]` Creates a row-major 5D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. * ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with row-major layout. ### `col_major` `col_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=col_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]` Creates a column-major layout using compile-time dimensions. **Parameters:** * ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor. Each value represents the size of a dimension. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim), __layout_init=True]` Creates a column-major 2D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim), __layout_init=True]` Creates a column-major 3D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim), __layout_init=True]` Creates a column-major 4D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. `col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim, dim), __layout_init=True]` Creates a column-major 5D layout using runtime dimensions. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): First dimension size. * ​shape1 (`ValueOrUnknown[dim]`): Second dimension size. * ​shape2 (`ValueOrUnknown[dim]`): Third dimension size. * ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size. * ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size. **Returns:** `LayoutTensorBuild` - A new builder with column-major layout. ### `layout` `layout[shape0: Int](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(shape0)), __layout_init=True]` Creates a 1D layout with a compile-time dimension. **Parameters:** * ​shape0 (`Int`): Size of the single dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified layout. `layout[rank: Int, shape: IndexList[rank], stride: IndexList[rank]](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](shape), _to_int_tuple[::Int](stride)), __layout_init=True]` Creates a custom layout with compile-time dimensions and strides. **Parameters:** * ​rank (`Int`): Number of dimensions. * ​shape (`IndexList[rank]`): List of dimension sizes. * ​stride (`IndexList[rank]`): List of strides for each dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified custom layout. `layout[rank: Int](self, shape: IndexList[rank], stride: IndexList[rank]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](-1), _to_int_tuple[::Int](-1)), __layout_init=True]` Creates a custom layout with runtime dimensions and strides. **Parameters:** * ​rank (`Int`): Number of dimensions. **Args:** * ​shape (`IndexList[rank]`): List of dimension sizes. * ​stride (`IndexList[rank]`): List of strides for each dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified custom layout. `layout(self, shape0: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(dim)), __layout_init=True]` Creates a 1D layout with a runtime dimension. **Args:** * ​shape0 (`ValueOrUnknown[dim]`): Size of the single dimension. **Returns:** `LayoutTensorBuild` - A new builder with the specified layout. ### `shared` `shared(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(3)]` Places the tensor in GPU shared memory. **Returns:** `LayoutTensorBuild` - A new builder with shared memory address space. ### `local` `local(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(5)]` Places the tensor in GPU local memory. **Returns:** `LayoutTensorBuild` - A new builder with local memory address space. ### `alloc` `alloc(self) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=__address_space]` Allocates a new tensor using the current layout. Note: Fails to compile if layout is not set, dimensions are not known, or tensor is circular. **Returns:** `LayoutTensor` - A newly allocated tensor with the specified layout ### `view` `view[address_space: AddressSpace](self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=address_space, layout_int_type=__layout_int_type, linear_idx_type=__index_type]` Creates a tensor view over existing memory. Note: Fails to compile if layout is not set, address spaces don't match, or tensor is circular. **Parameters:** * ​address\_space (`AddressSpace`): Memory address space for the tensor (generic, shared, local). **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): Pointer to memory region to create the view over. **Returns:** `LayoutTensor` - A tensor view over the specified memory region with the current layout. ### `circular` `circular(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=__address_space, __circular=True]` Enables circular indexing for the tensor. **Returns:** `LayoutTensorBuild` - A new builder with circular indexing enabled. ### `iter` `iter(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=__address_space], bound: Int) -> LayoutTensorIter[dtype, __layout, MutableAnyOrigin, address_space=__address_space, circular=__circular, layout_int_type=__layout_int_type, linear_idx_type=__index_type]` Creates an iterator over tensor elements. Note: Fails to compile if layout is not set or dimensions are not known. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=__address_space]`): Pointer to memory region. * ​bound (`Int`): Upper bound for iteration. **Returns:** `LayoutTensorIter` - An iterator over tensor elements. --- ## LayoutTensorIter `@register_passable(trivial)` `struct LayoutTensorIter[mut: Bool, //, type: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1, circular: Bool = False, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), layout_int_type: DType = _get_index_type(address_space), linear_idx_type: DType = _get_index_type(address_space), masked: Bool = False]` Iterator for traversing a memory buffer with a specific layout. `LayoutTensorIter` provides a way to iterate through memory according to a specific layout pattern, constructing layout tensors at each position. This enables efficient traversal of multi-dimensional data structures with custom memory layouts. Notes: The returned layout tensor is NOT vectorized. Users should explicitly vectorize if needed for performance-critical operations. ## Parameters * ​mut (`Bool`): Whether the iterator allows mutation of the underlying data. * ​type (`DType`): The data type of the tensor elements. * ​layout (`Layout`): The memory layout pattern to follow during iteration. * ​origin (`Origin[mut]`): Origin tracking for memory safety. * ​address\_space (`AddressSpace`): The memory address space (`GLOBAL`, `SHARED`, etc.). * ​alignment (`Int`): Memory alignment requirement for the data. * ​circular (`Bool`): Whether iteration wraps around at boundaries. * ​axis (`OptionalReg[Int]`): Optional axis for dimension-specific operations. * ​layout\_int\_type (`DType`): Integer type used for layout indices. * ​linear\_idx\_type (`DType`): Integer type used for indexing into memory. * ​masked (`Bool`): Whether to apply bounds masking during iteration. ## Fields * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory region being iterated, with appropriate type and memory attributes. * ​offset (`SIMD[linear_idx_type, 1]`): Current offset from the base pointer, representing the iterator's position in memory. * ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements or blocks in memory during iteration. * ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region, limiting the iteration range. * ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the layout pattern used for mapping logical indices to memory locations. * ​dimension\_bound (`SIMD[layout_int_type, 1]`): Boundary value for the current dimension when iterating along a specific axis. * ​idx (`SIMD[linear_idx_type, 1]`): Current logical index position within the iteration sequence. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `layout_uint_type` `alias layout_uint_type = SIMD[layout_int_type, 1]` The unsigned integer type used for layout, based on layout and address space. ### `linear_uint_type` `alias linear_uint_type = SIMD[linear_idx_type, 1]` The unsigned integer type used for indexing into memory. ## Methods ### `__init__` `__init__() -> Self` Initialize an empty iterator. Creates a default iterator with zero values, typically used as a placeholder or default value. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size()), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self` Initialize an iterator with a pointer and basic parameters. Creates an iterator for a memory region with the specified bounds and stride. **Constraints:** The layout must have all dimensions known at compile time. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region. * ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region. * ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements (defaults to layout size). * ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size() if layout.all_dims_known() else -1), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), dimension_bound: SIMD[layout_int_type, 1] = __init__[__mlir_type.!pop.int_literal](0), idx: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self` Initialize an iterator with a runtime layout. Creates an iterator with a runtime-determined layout, allowing for more flexible memory traversal patterns. **Constraints:** The runtime layout must have the same bitwidth as specified for the iterator. Circular iteration is not supported when an axis is defined. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region. * ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region. * ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): Layout determined at runtime. * ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements. * ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer. * ​dimension\_bound (`SIMD[layout_int_type, 1]`): Bound for the specified dimension when using masked iteration. * ​idx (`SIMD[linear_idx_type, 1]`): Initial index position. ### `__getitem__` `__getitem__(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Get the layout tensor at the current iterator position. Operator overload that returns a layout tensor representing the data at the current position of the iterator. **Returns:** A layout tensor at the current iterator position. ### `__iadd__` `__iadd__[T: Intable](mut self, rhs: T)` Increment the iterator by an integer value. Advances the iterator by the specified number of positions. Notes: This function is unsafe. It omits bound checking for performance reasons. Caller must ensure the index doesn't go out-of-bounds. **Parameters:** * ​T (`Intable`): A type that can be converted to an integer. **Args:** * ​rhs (`T`): The number of positions to advance. `__iadd__(mut self, rhs: SIMD[linear_idx_type, 1])` Increment the iterator by a uint value. Advances the iterator by the specified number of positions. Notes: This function is unsafe. It omits bound checking for performance reasons. Caller must ensure the index doesn't go out-of-bounds. **Args:** * ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance. ### `get` `get(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Get the layout tensor at the current iterator position. Returns a layout tensor representing the data at the current position of the iterator. **Returns:** A tensor view at the current iterator position with the same type, layout, and memory characteristics as specified by the output parameter. ### `next` `next[T: Intable](self, rhs: T) -> Self` Return an iterator pointing to a position ahead by rhs steps. Creates a new iterator that points rhs positions ahead of the current one. **Parameters:** * ​T (`Intable`): An integer-convertible type for the step size. **Args:** * ​rhs (`T`): The number of positions to advance. **Returns:** A new iterator pointing to the advanced position. `next(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self` Return an iterator pointing to a position ahead by rhs steps. Creates a new iterator that points rhs positions ahead of the current one. **Args:** * ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1). **Returns:** A new iterator pointing to the advanced position. ### `next_unsafe` `next_unsafe(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self` Return an iterator pointing to a position ahead by rhs steps (unsafe version). Creates a new iterator that points rhs positions ahead of the current one. This is an unsafe version that omits certain checks for performance. **Constraints:** Cannot be used with masked iterators. User must ensure rhs rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1). **Returns:** A new iterator pointing to the advanced position. ### `reshape` `reshape[dst_layout: Layout](self) -> LayoutTensorIter[type, dst_layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Reshape the iterator to a new layout. This method creates a new iterator with a different layout while preserving the underlying data. The new layout must have the same total size as the original. **Constraints:** * The destination layout must have the same total size as the original. * Both layouts must be contiguous. * Both layouts must have compile-time known dimensions. **Parameters:** * ​dst\_layout (`Layout`): The target layout to reshape to. **Returns:** A new iterator with the specified layout. ### `bitcast` `bitcast[new_type: DType, *, address_space: AddressSpace = address_space, alignment: Int = alignment](self) -> LayoutTensorIter[new_type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]` Reinterpret the iterator's underlying pointer as a different data type. This method performs a bitcast operation, allowing you to view the same memory location as a different data type without copying or converting the data. **Parameters:** * ​new\_type (`DType`): The target data type to cast to. * ​address\_space (`AddressSpace`): The memory address space for the new iterator (defaults to current). * ​alignment (`Int`): Memory alignment requirement for the new iterator (defaults to current). **Returns:** A new LayoutTensorIter with the same layout but different data type. --- ## LayoutTrait Defines the interface for mapping between logical coordinates and memory indices. The `LayoutTrait` provides a common interface for all layout types, including basic layouts, swizzles, and composed layouts. It enables mapping from multi-dimensional logical coordinates to linear memory indices, which is essential for tensor operations. Implementations of this trait must provide methods for: 1. Mapping coordinates to indices via the `__call__` method 2. Calculating the total size of the layout's domain 3. Calculating the size of the layout's codomain (memory footprint) 4. Indicating whether the layout has a valid shape This trait serves as the foundation for the layout system, allowing different layout implementations to be used interchangeably in algorithms. ## Implemented traits `AnyType`, `Copyable`, `UnknownDestructibility` ## Aliases ### `has_shape` `alias has_shape` Indicates whether the layout has a valid shape. Layouts and ComposedLayouts with at least one Layout have valid shapes and can be used in layout algebra. Swizzles don't have shapes and should be excluded from layout algebra. ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__call__` `__call__(self: _Self, index: IntTuple[origin]) -> Int` Maps a logical coordinate to a linear memory index. **Args:** * ​index (`IntTuple[origin]`): An IntTuple representing the logical coordinates to map. **Returns:** The linear memory index corresponding to the given coordinates. ### `size` `size(self: _Self) -> Int` Returns the total number of elements in the layout's domain. For a layout with shape (m, n), this returns m \* n, representing the total number of valid coordinates in the layout. **Returns:** The total number of elements in the layout. ### `cosize` `cosize(self: _Self) -> Int` Returns the size of the memory region spanned by the layout. For a layout with shape `(m, n)` and stride `(r, s)`, this returns `(m-1)*r + (n-1)*s + 1`, representing the memory footprint. **Returns:** The size of the memory region required by the layout. --- ## lcm `lcm(m: Int, n: Int, /) -> Int` Computes the least common multiple of two integers. **Args:** * ​m (`Int`): The first integer. * ​n (`Int`): The second integer. **Returns:** The least common multiple of the two integers. `lcm(s: Span[Int, origin], /) -> Int` Computes the least common multiple of a span of integers. **Args:** * ​s (`Span[Int, origin]`): A span of integers. **Returns:** The least common multiple of the span. `lcm(l: List[Int, hint_trivial_type], /) -> Int` Computes the least common multiple of a list of integers. **Args:** * ​l (`List[Int, hint_trivial_type]`): A list of integers. **Returns:** The least common multiple of the list. `lcm(*values: Int) -> Int` Computes the least common multiple of a variadic list of integers. **Args:** * ​\*values (`Int`): A variadic list of integers. **Returns:** The least common multiple of the list. --- ## ld_matrix `ld_matrix[type: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, simd_width]` Loads a matrix from shared memory into registers in a format suitable for tensor core operations. This function performs a warp-synchronized load from shared memory to registers, formatting the data to be directly usable by tensor core Matrix Multiply-Accumulate (MMA) instructions. Note: * All threads in a warp must execute this operation together. * For transposed loads, only half precision (float16) is supported. * The register width is fixed at 4 bytes (32 bits). * Supported configurations: * x1: One 32-bit register per thread. * x2: Two 32-bit registers per thread. * x4: Four 32-bit registers per thread. Example: ```mojo from gpu.mma import ld_matrix # Load 8x8 matrix of float16 values var data = ld_matrix[DType.float16, 8](ptr) # Load transposed matrix var transposed = ld_matrix[DType.float16, 8, transpose=True](ptr) ``` . **Parameters:** * ​type (`DType`): The data type of the matrix elements (e.g. float16, float32). * ​simd\_width (`Int`): The width of the SIMD vector to load. * ​transpose (`Bool`): Whether to transpose the matrix during load (only supported for half precision). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory containing the source matrix data. **Returns:** SIMD vector containing the loaded matrix data, properly formatted for MMA operations. --- ## ldexp `ldexp[dtype: DType, width: Int, //](x: SIMD[dtype, width], exp: SIMD[int32, width]) -> SIMD[dtype, width]` Computes elementwise ldexp function. The ldexp function multiplies a floating point value x by the number 2 raised to the exp power. I.e. $ldexp(x,exp)$ calculate the value of $x * 2^{exp}$ and is used within the $erf$ function. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector of floating point values. * ​exp (`SIMD[int32, width]`): SIMD vector containing the exponents. **Returns:** Vector containing elementwise result of ldexp on x and exp. --- ## ldg `ldg[type: DType, //, width: Int = 1, *, alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]()](x: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]` Load data from global memory through the non-coherent cache. This function provides a hardware-accelerated global memory load operation that uses the GPU's non-coherent cache (equivalent to CUDA's `__ldg` instruction). It optimizes for read-only data access patterns. Note: * Uses invariant loads which indicate the memory won't change during kernel execution. * Particularly beneficial for read-only texture-like access patterns. * May improve performance on memory-bound kernels. **Parameters:** * ​type (`DType`): The data type to load (must be numeric). * ​width (`Int`): The SIMD vector width for vectorized loads. * ​alignment (`Int`): Memory alignment in bytes. Defaults to natural alignment of the SIMD vector type. **Args:** * ​x (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory location to load from. **Returns:** SIMD vector containing the loaded data. --- ## ldx `ldx(gpr: Int)` --- ## ldy `ldy(gpr: Int)` --- ## ldz `ldz(gpr: Int)` --- ## ldzi `ldzi(gpr: Int)` --- ## len Provides the `len()` function and its associated traits. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Sized`](/mojo/stdlib/builtin/len/Sized): The `Sized` trait describes a type that has an integer length (such as a string or array). * [​`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising): The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined. * [​`UIntSized`](/mojo/stdlib/builtin/len/UIntSized): The `Sized` trait describes a type that has an integer length (such as a string or array). ## Functions * [​`len`](/mojo/stdlib/builtin/len/len): Get the length of a value. --- ## len `len[T: Sized](value: T) -> Int` Get the length of a value. **Parameters:** * ​T (`Sized`): The Sized type. **Args:** * ​value (`T`): The object to get the length of. **Returns:** The length of the object. `len[T: SizedRaising](value: T) -> Int` Get the length of a value. **Parameters:** * ​T (`SizedRaising`): The Sized type. **Args:** * ​value (`T`): The object to get the length of. **Returns:** The length of the object. **Raises:** If the length cannot be computed. --- ## LessThanComparable A type which can be less than compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__lt__` `__lt__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than `rhs`. --- ## LessThanOrEqualComparable A type which can be less than or equal to compared with other instances of itself. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__le__` `__le__(self: _Self, rhs: _Self) -> Bool` Define whether `self` is less than or equal to `rhs`. **Args:** * ​rhs (`_Self`): The right hand side of the comparison. **Returns:** True if `self` is less than or equal to `rhs`. --- ## Level `struct Level` Represents logging severity levels. Defines the available logging levels in ascending order of severity. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `CRITICAL` `alias CRITICAL = Level(50)` A serious error indicating that the program itself may be unable to continue running. ### `DEBUG` `alias DEBUG = Level(10)` Detailed information, typically of interest only when diagnosing problems. ### `ERROR` `alias ERROR = Level(40)` Due to a more serious problem, the software has not been able to perform some function. ### `INFO` `alias INFO = Level(20)` Confirmation that things are working as expected. ### `NOTSET` `alias NOTSET = Level(0)` Lowest level, used when no level is set. ### `WARNING` `alias WARNING = Level(30)` Indication that something unexpected happened, or may happen in the near future. ## Methods ### `__lt__` `__lt__(self, other: Self) -> Bool` Returns True if this level is less than the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is less than the other level, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Returns True if this level is less than or equal to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is less than or equal to the other level, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Returns True if this level equals the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if the levels are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Returns True if this level does not equal the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if the levels are not equal, False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Returns True if this level is greater than the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is greater than the other level, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Returns True if this level is greater than or equal to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is greater than or equal to the other level, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Returns True if this level is identical to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is identical to the other level, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Returns True if this level is not identical to the other level. **Args:** * ​other (`Self`): The level to compare with. **Returns:** Bool: True if this level is not identical to the other level, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the string representation of this level to a writer. **Parameters:** * ​W (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`W`): The writer to write to. ### `__str__` `__str__(self) -> String` Returns the string representation of this level. **Returns:** String: A human-readable string representation of the level (e.g., "DEBUG", "INFO"). ### `__repr__` `__repr__(self) -> String` Returns the detailed string representation of this level. **Returns:** String: A string representation including the type name and level value (e.g., "Level.DEBUG"). --- ## lexists `lexists[PathLike: PathLike, //](path: PathLike) -> Bool` Return True if path exists or is a broken symlink. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns True if the path exists or is a broken symbolic link. --- ## lgamma `lgamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `lgamma` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `lgamma` of the input. --- ## Life of a value The life of a value in Mojo begins when a variable is initialized and continues up until the value is last used, at which point Mojo destroys it. This page describes how every value in Mojo is created, copied, and moved. (The next page describes [how values are destroyed](/mojo/manual/lifecycle/death).) All data types in Mojo—including basic types in the standard library such as [`Bool`](/mojo/stdlib/builtin/bool/Bool), [`Int`](/mojo/stdlib/builtin/int/Int), and [`String`](/mojo/stdlib/collections/string/string/String), up to complex types such as [`SIMD`](/mojo/stdlib/builtin/simd/SIMD)—are defined as a [struct](/mojo/manual/structs). This means the creation and destruction of any piece of data follows the same lifecycle rules, and you can define your own data types that work exactly the same way. Mojo structs don't get any default lifecycle methods, such as a constructor, copy constructor, or move constructor. That means you can create a struct without a constructor, but then you can't instantiate it, and it would be useful only as a sort of namespace for static methods. For example: ```mojo struct NoInstances: var state: Int @staticmethod fn print_hello(): print("Hello world!") ``` Without a constructor, this cannot be instantiated, so it has no lifecycle. The `state` field is also useless because it cannot be initialized (Mojo structs do not support default field values—you must initialize them in a constructor). So the only thing you can do is call the static method: ```mojo NoInstances.print_hello() ``` ```output Hello world! ``` ## Constructor To create an instance of a Mojo type, it needs the `__init__()` constructor method. The main responsibility of the constructor is to initialize all fields. For example: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, name: String, age: Int): self.name = name self.age = age ``` Now we can create an instance: ```mojo var mine = MyPet("Loki", 4) ``` An instance of `MyPet` can also be [read](/mojo/manual/values/ownership#read-arguments-read) and destroyed, but it currently can't be copied or moved. We believe this is a good default starting point, because there are no built-in lifecycle events and no surprise behaviors. You—the type author—must explicitly decide whether and how the type can be copied or moved, by implementing the copy and move constructors. :::note Mojo does not require a destructor to destroy an object. As long as all fields in the struct are destructible (every type in the standard library is destructible, except for [pointers](/mojo/stdlib/memory/unsafe)), then Mojo knows how to destroy the type when its lifetime ends. We'll discuss that more in [Death of a value](/mojo/manual/lifecycle/death). ::: ### Overloading the constructor Like any other function/method, you can [overload](/mojo/manual/functions#overloaded-functions) the `__init__()` constructor to initialize the object with different arguments. For example, you might want a default constructor that sets some default values and takes no arguments, and then additional constructors that accept more arguments. Just be aware that, in order to modify any fields, each constructor must declare the `self` argument with the [`out` convention](/mojo/manual/values/ownership#mutable-arguments-mut). If you want to call one constructor from another, you simply call upon that constructor as you would externally (you don't need to pass `self`). For example, here's how you can delegate work from an overloaded constructor: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self): self.name = "" self.age = 0 fn __init__(out self, name: String): self = MyPet() self.name = name ``` ### Field initialization Notice in the previous example that, by the end of each constructor, all fields must be initialized. That's the only requirement in the constructor. In fact, the `__init__()` constructor is smart enough to treat the `self` object as fully initialized even before the constructor is finished, as long as all fields are initialized. For example, this constructor can pass around `self` as soon as all fields are initialized: ```mojo fn use(arg: MyPet): pass struct MyPet: var name: String var age: Int fn __init__(out self, name: String, age: Int, cond: Bool): self.name = name if cond: self.age = age use(self) # Safe to use immediately! self.age = age use(self) # Safe to use immediately! ``` ### Constructors and implicit conversion Mojo supports implicit conversion from one type to another. Implicit conversion can happen when one of the following occurs: - You assign a value of one type to a variable with a different type. - You pass a value of one type to a function that requires a different type. - You return a value of one type from a function that specifies a different return type. In all cases, implicit conversion is supported when the target type defines a constructor that meets the following criteria: - Is declared with the `@implicit` decorator. - Has a single required, non-keyword argument of the source type. For example: ```mojo var a = Source() var b: Target = a ``` Mojo implicitly converts the `Source` value in `a` to a `Target` value if `Target` defines a matching constructor like this: ```mojo struct Target: @implicit fn __init__(out self, s: Source): ... ``` With implicit conversion, the assignment above is essentially identical to: ```mojo var b = Target(a) ``` In general, types should only support implicit conversions when the conversion lossless, and ideally inexpensive. For example, converting an integer to a floating-point number is usually lossless (except for very large positive and negative integers, where the conversion may be approximate), but converting a floating-point number to an integer is very likely to lose information. So Mojo supports implicit conversion from `Int` to `Float64`, but not the reverse. The constructor used for implicit conversion can take optional arguments, so the following constructor would also support implicit conversion from `Source` to `Target`: ```mojo struct Target: @implicit fn __init__(out self, s: Source, reverse: Bool = False): ... ``` Implicit conversion can fail if Mojo can't unambiguously match the conversion to a constructor. For example, if the target type has two overloaded constructors that take different types, and each of those types supports an implicit conversion from the source type, the compiler has two equally-valid paths to convert the values: ```mojo struct A: @implicit fn __init__(out self, s: Source): ... struct B: @implicit fn __init__(out self, s: Source): ... struct OverloadedTarget: @implicit fn __init__(out self, a: A): ... @implicit fn __init__(out self, b: B): ... var t = OverloadedTarget(Source()) # Error: ambiguous call to '__init__': each # candidate requires 1 implicit conversion ``` In this case, you can fix the issue by explicitly casting to one of the intermediate types. For example: ```mojo var t = OverloadedTarget(A(Source())) # OK ``` Mojo applies at most one implicit conversion to a variable. For example: ```mojo var t: OverloadedTarget = Source() # Error: can't implicitly convert Source # to Target ``` Would fail because there's no direct conversion from `Source` to `OverloadedTarget`. ## Copy constructor When Mojo encounters an assignment statement that doesn't use the [transfer sigil (`^`)](/mojo/manual/values/ownership#transfer-arguments-owned-and-), it tries to make a copy of the right-side value by calling upon that type's copy constructor: the `__copyinit__()` method. Thus, it's the responsibility of the type author to implement `__copyinit__()` so it returns a copy of the value. For example, the `MyPet` type above does not have a copy constructor, so this code fails to compile: ```mojo var mine = MyPet("Loki", 4) var yours = mine # This requires a copy, but MyPet has no copy constructor ``` To make it work, we need to add the copy constructor, like this: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, name: String, age: Int): self.name = name self.age = age fn __copyinit__(out self, existing: Self): self.name = existing.name self.age = existing.age ``` :::note `Self` (capital "S") is an alias for the current type name (`MyPet`, in this example). Using this alias is a best practice to avoid any mistakes when referring to the current struct name. Also, notice that the `existing` argument in `__copyinit__()` is immutable because the default [argument convention](/mojo/manual/values/ownership#argument-conventions) is `read`—this is a good thing because this function should not modify the contents of the value being copied. ::: Now this code works to make a copy: ```mojo var mine = MyPet("Loki", 4) var yours = mine ``` What makes Mojo's copy behavior different, compared to other languages, is that `__copyinit__()` is designed to perform a deep copy of all fields in the type (as per [value semantics](/mojo/manual/values/value-semantics)). That is, it copies heap-allocated values, rather than just copying the pointer. However, the Mojo compiler doesn't enforce this, so it's the type author's responsibility to implement `__copyinit__()` with value semantics. For example, here's a new `HeapArray` type that performs a deep copy in the copy constructor: ```mojo struct HeapArray: var data: UnsafePointer[Int] var size: Int var cap: Int fn __init__(out self, size: Int, val: Int): self.size = size self.cap = size * 2 self.data = UnsafePointer[Int].alloc(self.cap) for i in range(self.size): (self.data + i).init_pointee_copy(val) fn __copyinit__(out self, existing: Self): # Deep-copy the existing value self.size = existing.size self.cap = existing.cap self.data = UnsafePointer[Int].alloc(self.cap) for i in range(self.size): (self.data + i).init_pointee_copy(existing.data[i]) # The lifetime of `existing` continues unchanged fn __del__(owned self): # We must free the heap-allocated data, but # Mojo knows how to destroy the other fields for i in range(self.size): (self.data + i).destroy_pointee() self.data.free() fn append(mut self, val: Int): # Update the array for demo purposes if self.size 0: print(", ", end="") print(self.data[i], end="") print("]") ``` Notice that `__copyinit__()` does not copy the `UnsafePointer` value (doing so would make the copied value refer to the same `data` memory address as the original value, which is a shallow copy). Instead, we initialize a new `UnsafePointer` to allocate a new block of memory, and then copy over all the heap-allocated values (this is a deep copy). Thus, when we copy an instance of `HeapArray`, each copy has its own value on the heap, so changes to one value do not affect the other, as shown here: ```mojo fn copies(): var a = HeapArray(2, 1) var b = a # Calls the copy constructor a.dump() # Prints [1, 1] b.dump() # Prints [1, 1] b.append(2) # Changes the copied data b.dump() # Prints [1, 1, 2] a.dump() # Prints [1, 1] (the original did not change) ``` :::note In `HeapArray`, we must use the `__del__()` destructor to free the heap-allocated data when the `HeapArray` lifetime ends, but Mojo automatically destroys all other fields when their respective lifetimes end. We'll discuss this destructor more in [Death of a value](/mojo/manual/lifecycle/death). ::: If your type doesn't use any pointers for heap-allocated data, then writing the constructor and copy constructor is all boilerplate code that you shouldn't have to write. For most structs that don't manage memory explicitly, you can just add the [`@value` decorator](/mojo/manual/decorators/value) to your struct definition and Mojo will synthesize the `__init__()`, `__copyinit__()`, and `__moveinit__()` methods. :::note Mojo also calls upon the copy constructor when a value is passed to a function that takes the argument as [`owned`](/mojo/manual/values/ownership#transfer-arguments-owned-and-) *and* when the lifetime of the given value does *not* end at that point. If the lifetime of the value does end there (usually indicated with the transfer sigil `^`), then Mojo instead invokes the move constructor. ::: ## Move constructor Although copying values provides predictable behavior that matches Mojo's [value semantics](/mojo/manual/values/value-semantics), copying some data types can be a significant hit on performance. If you're familiar with reference semantics, then the solution here might seem clear: instead of making a copy when passing a value, share the value as a reference. And if the original variable is no longer needed, nullify the original to avoid any double-free or use-after-free errors. That's generally known as a move operation: the memory block holding the data remains the same (the memory does not actually move), but the pointer to that memory moves to a new variable. To support moving a value, implement the `__moveinit__()` method. The `__moveinit__()` method performs a consuming move: it [transfers ownership](/mojo/manual/values/ownership#transfer-arguments-owned-and-) of a value from one variable to another when the original variable's lifetime ends (also called a "destructive move"). :::note A move constructor is **not required** to transfer ownership of a value. Unlike in Rust, transferring ownership is not always a move operation; the move constructors are only part of the implementation for how Mojo transfers ownership of a value. You can learn more in the section about [ownership transfer](/mojo/manual/values/ownership#transfer-arguments-owned-and-). ::: When a move occurs, Mojo immediately invalidates the original variable, preventing any access to it and disabling its destructor. Invalidating the original variable is important to avoid memory errors on heap-allocated data, such as use-after-free and double-free errors. Here's how to add the move constructor to the `HeapArray` example: ```mojo struct HeapArray: var data: UnsafePointer[Int] var size: Int var cap: Int fn __init__(out self, size: Int, val: Int): self.size = size self.cap = size * 2 self.data = UnsafePointer[Int].alloc(self.cap) for i in range(self.size): (self.data + i).init_pointee_copy(val) fn __copyinit__(out self, existing: Self): # Deep-copy the existing value self.size = existing.size self.cap = existing.cap self.data = UnsafePointer[Int].alloc(self.cap) for i in range(self.size): (self.data + i).init_pointee_copy(existing.data[i]) # The lifetime of `existing` continues unchanged fn __moveinit__(out self, owned existing: Self): print("move") # Shallow copy the existing value self.size = existing.size self.cap = existing.cap self.data = existing.data # Then the lifetime of `existing` ends here, but # Mojo does NOT call its destructor fn __del__(owned self): # We must free the heap-allocated data, but # Mojo knows how to destroy the other fields for i in range(self.size): (self.data + i).destroy_pointee() self.data.free() fn append(mut self, val: Int): # Update the array for demo purposes if self.size 0: print(", ", end="") print(self.data[i], end="") print("]") ``` The critical feature of `__moveinit__()` is that it takes the incoming value as `owned`, meaning this method gets unique ownership of the value. Moreover, because this is a dunder method that Mojo calls only when performing a move (during ownership transfer), the `existing` argument is guaranteed to be a mutable reference to the original value, *not a copy* (unlike other methods that may declare an argument as `owned`, but might receive the value as a copy if the method is called without the [`^` transfer sigil](/mojo/manual/values/ownership#transfer-arguments-owned-and-)). That is, Mojo calls this move constructor *only* when the original variable's lifetime actually ends at the point of transfer. Here's an example showing how to invoke the move constructor for `HeapArray`: ```mojo fn moves(): var a = HeapArray(3, 1) a.dump() # Prints [1, 1, 1] var b = a^ # Prints "move"; the lifetime of `a` ends here b.dump() # Prints [1, 1, 1] #a.dump() # ERROR: use of uninitialized value 'a' ``` Notice that `__moveinit__()` performs a shallow copy of the existing field values (it copies the pointer, instead of allocating new memory on the heap), which is what makes it useful for types with heap-allocated values that are expensive to copy. To go further and ensure your type can never be copied, you can make it "move-only" by implementing `__moveinit__()` and *excluding* `__copyinit__()`. A move-only type can be passed to other variables and passed into functions with any argument convention (`read`, `mut`, and `owned`)—the only catch is that you must use the `^` transfer sigil to end the lifetime of a move-only type when assigning it to a new variable or when passing it as an `owned` argument. :::note For types without heap-allocated fields, you get no real benefit from the move constructor. Making copies of simple data types on the stack, like integers, floats, and booleans, is very cheap. Yet, if you allow your type to be copied, then there's generally no reason to disallow moves, so you can synthesize both constructors by adding the [`@value` decorator](/mojo/manual/decorators/value). ::: ## Simple value types {#value-decorator} Because copy and move constructors are opt-in, Mojo provides great control for exotic use cases (such as for atomic values that should never be copied or moved), but most structs are simple aggregations of other types that should be easily copied and moved, and we don't want to write a lot of boilerplate constructors for those simple value types. To solve this, Mojo provides the [`@value` decorator](/mojo/manual/decorators/value), which synthesizes the boilerplate code for the `__init__()`, `__copyinit__()`, and `__moveinit__()` methods. For example, consider a simple struct like this: ```mojo @value struct MyPet: var name: String var age: Int ``` Mojo sees the `@value` decorator and notices that you don't have a member-wise initializer (a constructor with arguments for each field), a copy constructor, or a move constructor, so it synthesizes them for you. The result is as if you had actually written this: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, owned name: String, age: Int): self.name = name^ self.age = age fn __copyinit__(out self, existing: Self): self.name = existing.name self.age = existing.age fn __moveinit__(out self, owned existing: Self): self.name = existing.name^ self.age = existing.age ``` Mojo synthesizes each lifecycle method only when it doesn't exist, so you can use `@value` and still define your own versions to override the default behavior. For example, it is fairly common to use the default member-wise and move constructor, but create a custom copy constructor. Another common pattern is to use `@value` to create a member-wise constructor, and add overloads that take different sets of arguments. For example, if you want to create a `MyPet` struct without specifying an age, you could add an overloaded constructor: ```mojo @value struct MyPet: var name: String var age: Int fn __init__(out self, owned name: String): self.name = name^ self.age = 0 ``` Note that this overloaded constructor **doesn't** prevent the `@value` decorator from synthesizing the member-wise constructor. To override this default constructor, you'd need to add a constructor with the same signature as the default member-wise constructor. Something you can see in this code that we didn't mention yet is that the `__init__()` method takes all arguments as `owned`, because the constructor must take ownership to store each value. This is a useful micro-optimization and enables the use of move-only types. Trivial types like `Int` are also passed as `owned`, but because ownership doesn't mean anything for integers, we can elide that declaration and the transfer sigil (`^`) for simplicity. The transfer operator is also just a formality in this case, because, even if it's not used with `self.name = name^`, the Mojo compiler will notice that `name` is last used here and convert this assignment into a move, instead of a copy+delete. :::note If your type contains any move-only fields, Mojo will not generate the copy constructor because it cannot copy those fields. Further, the `@value` decorator won't work at all if any of your members are neither copyable nor movable. For example, if you have something like `Atomic` in your struct, then it probably isn't a true value type, and you don't want the copy/move constructors anyway. Also notice that the `MyPet` struct above doesn't include the `__del__()` destructor (the `@value` decorator does not synthesize this), because Mojo doesn't need it to destroy fields, as discussed in [Death of a value](/mojo/manual/lifecycle/death) ::: ## Trivial types So far, we've talked about values that live in memory, which means they have an identity (an address) that can be passed around among functions (passed "by reference"). This is great for most types, and it's a safe default for large objects with expensive copy operations. However, it's inefficient for tiny things like a single integer or floating point number. We call these types "trivial" because they are just "bags of bits" that should be copied, moved, and destroyed without invoking any custom lifecycle methods. Trivial types are the most common types that surround us, and from a language perspective, Mojo doesn't need special support for these written in a struct. Usually, these values are so tiny that they should be passed around in CPU registers, not indirectly through memory. As such, Mojo provides a struct decorator to declare these types of values: `@register_passable("trivial")`. This decorator tells Mojo that the type should be copyable and movable but that it has no user-defined logic (no lifecycle methods) for doing this. It also tells Mojo to pass the value in CPU registers whenever possible, which has clear performance benefits. You'll see this decorator on types like `Int` in the standard library: ```mojo @register_passable("trivial") struct Int: var value: __mlir_type.index fn __init__(value: __mlir_type.index) -> Int: return Self {value: value} ... ``` We expect to use this decorator pervasively on Mojo standard library types, but it is safe to ignore for general application-level code. For more information, see the [`@register_passable` documentation](/mojo/manual/decorators/register-passable). :::note TODO This decorator is due for reconsideration. Lack of custom copy/move/destroy logic and "passability in a register" are orthogonal concerns and should be split. This former logic should be subsumed into a more general `@value("trivial")` decorator, which is orthogonal from `@register_passable`. ::: --- ## Lifetimes, origins, and references The Mojo compiler includes a lifetime checker, a compiler pass that analyzes dataflow through your program. It identifies when variables are valid and inserts destructor calls when a variable's lifetime ends. The Mojo compiler uses a special value called an *origin* to track the lifetime of variables and the validity of references. Specifically, an origin answers two questions: * What variable "owns" this value? * Can the value be mutated using this reference? For example, consider the following code: ```mojo fn print_str(s: String): print(s) name = String("Joan") print_str(name) ``` ```output Joan ``` The line `name = String("Joan")` declares a variable with an identifier (`name`) and logical storage space for a `String` value. When you pass `name` into the `print_str()` function, the function gets an immutable reference to the value. So both `name` and `s` refer to the same logical storage space, and have associated origin values that lets the Mojo compiler reason about them. Most of the time, origins are handled automatically by the compiler. However, in some cases you'll need to interact with origins directly: * When working with references—specifically `ref` arguments and `ref` return values. * When working with types like [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) or [`Span`](/mojo/stdlib/memory/span/Span) which are parameterized on the origin of the data they refer to. This section also covers [`ref` arguments](#ref-arguments) and [`ref` return values](#ref-return-values), which let functions take arguments and provide return values as references with parametric origins. ## Working with origins Mojo's origin values are unlike most other values in the language, because they're primitive values, not Mojo structs. Likewise, because these values are mostly created by the compiler, you can't just create your own origin value—you usually need to derive an origin from an existing value. ### Origin types Mojo supplies a struct and a set of aliases that you can use to specify origin types. As the names suggest, the `ImmutableOrigin` and `MutableOrigin` aliases represent immutable and mutable origins, respectively: ```mojo struct ImmutableRef[origin: ImmutableOrigin]: pass ``` Or you can use the [`Origin`](/mojo/stdlib/builtin/type_aliases/Origin) struct to specify an origin with parametric mutability: ```mojo struct ParametricRef[ is_mutable: Bool, //, origin: Origin[is_mutable] ]: pass ``` Origin types carry the mutability of a reference as a boolean parameter value, indicating whether the origin is mutable, immutable, or even with mutability depending on a parameter specified by the enclosing API. The `is_mutable` parameter here is an [infer-only parameter](/mojo/manual/parameters/#infer-only-parameters). The `origin` value is often inferred, as well. For example, the following code creates a [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to an existing value, but doesn't need to specify an origin—the `origin` is inferred from the existing value. ```mojo from memory import Pointer def use_pointer(): a = 10 ptr = Pointer(to=a) ``` A final type of origin value is an `OriginSet`. As the name suggests, an `OriginSet` represents a group of origins. ### Origin values Most origin values are created by the compiler. As a developer, there are a few ways to specify origin values: * Static origin. The `StaticConstantOrigin` alias is an origin value representing immutable values that last for the duration of the program. String literal values have a `StaticConstantOrigin`. * Derived origin. The `__origin_of()` magic function returns the origin associated with the value (or values) passed in. * Inferred origin. You can use inferred parameters to capture the origin of a value passed in to a function. * Wildcard origins. The `ImmutableAnyOrigin` and `MutableAnyOrigin` aliases are special cases indicating a reference that might access any live value. #### Static origins You can use the static origin `StaticConstantOrigin` when you have a value that exists for the entire duration of the program. For example, the `StringLiteral` method [`as_string_slice()`](/mojo/stdlib/builtin/string_literal/StringLiteral#as_string_slice) returns a [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) pointing to the original string literal. String literals are static—they're allocated at compile time and never destroyed—so the slice is created with an immutable, static origin. #### Derived origins Use the `__origin_of(value)` operator to obtain a value's origin. An argument to `__origin_of()` can take an arbitrary expression that yields one of the following: - An origin value. - A value with a memory location. For example: ```mojo __origin_of(self) __origin_of(x.y) __origin_of(foo()) ``` The `__origin_of()` operator is analyzed statically at compile time; The expressions passed to `__origin_of()` are never evaluated. (For example, when the compiler analyzes `__origin_of(foo())`, it doesn't run the `foo()` function.) The following struct stores a string value using a [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer): a smart pointer that holds an owned value. The `as_ptr()` method returns a `Pointer` to the stored string, using the same origin as the original `OwnedPointer`. ```mojo from memory import OwnedPointer, Pointer struct BoxedString: var o_ptr: OwnedPointer[String] fn __init__(out self, value: String): self.o_ptr = OwnedPointer(value) fn as_ptr(mut self) -> Pointer[String, __origin_of(self.o_ptr)]: return Pointer(to=self.o_ptr[]) ``` Note that the `as_ptr()` method takes its `self` argument as `mut self`. If it used the default `read` argument convention, it would be immutable, and the derived origin (`__origin_of(self.o_ptr)`) would also be immutable. You can also pass multiple expressions to `__origin_of()` to express the union of two or more origins: `__origin_of(a, b)` #### Inferred origins The other common way to access an origin value is to *infer* it from the the arguments passed to a function or method. For example, the `Span` type has an associated `origin`: ```mojo struct Span[ is_mutable: Bool, //, T: Copyable & Movable, origin: Origin[is_mutable], ](CollectionElementNew): """A non owning view of contiguous data. ``` One of its constructors creates a `Span` from an existing `List`, and infers its `origin` value from the list: ```mojo fn __init__(out self, ref [origin]list: List[T, *_]): """Construct a Span from a List. Args: list: The list to which the span refers. """ self._data = list.data self._len = len(list) ``` ## Working with references You can use the `ref` keyword with arguments and return values to specify a reference with parametric mutability. That is, they can be either mutable or immutable. From inside the called function, a `ref` argument looks like a `read` or `mut` argument. A `ref` return value looks like any other return value to the calling function, but it is a *reference* to an existing value, not a copy. ### `ref` arguments The `ref` argument convention lets you specify an argument of parametric mutability: that is, you don't need to know in advance whether the passed argument will be mutable or immutable. There are several reasons you might want to use a `ref` argument: * You want to accept an argument with parametric mutability. * You want to tie the lifetime of one argument to the lifetime of another argument. * When you want an argument that is guaranteed to be passed in memory: this can be important and useful for generic arguments that need an identity, irrespective of whether the concrete type is register passable. The syntax for a `ref` argument is: ref arg_name: arg_type Or: ref [origin_specifier(s)] arg_name: arg_type In the first form, the origin and mutability of the `ref` argument is inferred from the value passed in. The second form includes an origin clause, consisting of one or more origin specifiers inside square brackets. An origin specifier can be either: * An origin value. * An arbitrary expression, which is treated as shorthand for `__origin_of(expression)`. In other words, the following declarations are equivalent: ```mojo ref [__origin_of(self)] ref [self] ``` * An [`AddressSpace`](/nightly/mojo/stdlib/memory/pointer/AddressSpace) value. * An underscore character (`_`) to indicate that the origin is *unbound*. This is equivalent to omitting the origin specifier. ```mojo def add_ref(ref a: Int, b: Int) -> Int: return a+b ``` You can also name the origin explicitly. This is useful if you want to specify an `ImmutableOrigin` or `MutableOrigin`, or if you want to bind to the `is_mutable` parameter. ```mojo def take_str_ref[ is_mutable: Bool, //, origin: Origin[is_mutable] ](ref [origin] s: String): @parameter if is_mutable: print("Mutable: " + s) else: print("Immutable: " + s) def pass_refs(s1: String, owned s2: String): take_str_ref(s1) take_str_ref(s2) pass_refs("Hello", "Goodbye") ``` ```output Immutable: Hello Mutable: Goodbye ``` ### `ref` return values Like `ref` arguments, `ref` return values allow a function to return a mutable or immutable reference to a value. The syntax for a `ref` return value is: -> ref [origin_specifier(s)] arg_type Note that you **must** specify an origin specifier for a `ref` return value. The values allowed for origin specifiers are the same as the ones listed for [`ref` arguments](#ref-arguments). `ref` return values can be an efficient way to handle updating items in a collection. The standard way to do this is by implementing the `__getitem__()` and `__setitem__()` dunder methods. These are invoked to read from and write to a subscripted item in a collection: ```mojo value = list[a] list[b] += 10 ``` With a `ref` argument, `__getitem__()` can return a mutable reference that can be modified directly. This has pros and cons compared to using a `__setitem__()` method: * The mutable reference is more efficient—a single update isn't broken up across two methods. However, the referenced value must be in memory. * A `__getitem__()`/`__setitem__()` pair allows for arbitrary code to be run when values are retrieved and set. For example, `__setitem__()` can validate or constrain input values. For example, in the following example, `NameList` has a `__getitem__()` method that returns a reference: ```mojo struct NameList: var names: List[String] def __init__(out self, *names: String): self.names = List[String]() for name in names: self.names.append(name[]) def __getitem__(ref self, index: Int) -> ref [self.names] String: if (index >=0 and index ref [self] String: ``` Since the `origin` of the return value is tied to the origin of `self`, the returned reference will be mutable if the method was called using a mutable reference. The method still works if you have an immutable reference to the `NameList`, but it returns an immutable reference: ```mojo fn pass_immutable_list(list: NameList) raises: print(list[2]) # list[2] += "?" # Error, this list is immutable def use_name_list_again(): list = NameList("Sophie", "Jack", "Diana") pass_immutable_list(list) use_name_list_again() ``` ```output Diana ``` Without parametric mutability, you'd need to write two versions of `__getitem__()`, one that accepts an immutable `self` and another that accepts a mutable `self`. #### Return values with union origins A `ref` return value can include multiple values in its origin specifier, which yields the union of the origins. For example, the following `pick_one()` function returns a reference to one of the two input strings, with an origin that's a union of both origins. ```mojo def pick_one(cond: Bool, ref a: String, ref b: String) -> ref [a, b] String: return a if cond else b ``` --- ## likely `likely(val: Bool) -> Bool` Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers. **Args:** * ​val (`Bool`): The input value which is likely to be `True` most of the time. **Returns:** The input value. --- ## linalg Provides CPU and GPU implementations of linear algebra functions. ## Modules * [​`accumulate`](./accumulate/): * [​`apple_accelerate`](./apple_accelerate/): * [​`apple_amx_intrinsics`](./apple_amx_intrinsics/): * [​`bmm`](./bmm/): * [​`dispatch_table_a100_gpu`](./dispatch_table_a100_gpu/): * [​`dispatch_table_amd`](./dispatch_table_amd/): * [​`distributed_matmul`](./distributed_matmul/): * [​`dual_gemm`](./dual_gemm/): * [​`fast_div`](./fast_div/): Implements the fast division algorithm. * [​`fp8_quantization`](./fp8_quantization/): * [​`gemv`](./gemv/): * [​`grouped_matmul`](./grouped_matmul/): * [​`intel_amx_intrinsics`](./intel_amx_intrinsics/): * [​`matmul`](./matmul/): * [​`matmul_default`](./matmul_default/): * [​`matmul_gpu`](./matmul_gpu/): * [​`matmul_i8mm`](./matmul_i8mm/): * [​`matmul_neon`](./matmul_neon/): * [​`matmul_sm90`](./matmul_sm90/): * [​`matmul_tile_scheduler`](./matmul_tile_scheduler/): * [​`matmul_vendor`](./matmul_vendor/): * [​`matmul_vnni`](./matmul_vnni/): * [​`matrix_band_part`](./matrix_band_part/): The module implements matrix band part functions. * [​`neon_intrinsics`](./neon_intrinsics/): * [​`packing`](./packing/): * [​`qr_factorization`](./qr_factorization/): * [​`transpose`](./transpose/): The module implements Transpose functions. * [​`utils`](./utils/): * [​`utils_gpu`](./utils_gpu/): * [​`vendor_blas`](./vendor_blas/): * [​`vnni_intrinsics`](./vnni_intrinsics/): --- ## linear Multi-layer Perceptron. ## `ColumnParallelLinear` {#max.nn.linear.ColumnParallelLinear} > *class* max.nn.linear.ColumnParallelLinear(in\_dim, out\_dim, dtype, devices, tied\_weight=None, \*\*kwargs) A Linear layer where the weight and bias are sharded onto multiple devices. This layer first computes $y = xW_i^T + b_i$ for each device i in \[0,…, num\_devices]: ```default +-----+ +-----+ T +-----+ +-----+ | | | W_0 | | b_0 | | y_0 | GPU0 | | +-----+ +-----+ +-----+ | | | W_1 | | b_1 | | y_1 | GPU1 | x | @ +-----+ + +-----+ = +-----+ | | | W_2 | | b_2 | | y_2 | GPU2 | | +-----+ +-----+ +-----+ | | | W_3 | | b_3 | | y_3 | GPU3 +-----+ +-----+ +-----+ +-----+ ``` The values are then collected using an Allgather op, producing the same output tensor $y = xW^T + b$ on each device: ```default GPU0 GPU1 GPU2 GPU3 GPU0 GPU1 GPU2 GPU3 +-----+-----+-----+-----+ +-----+-----+-----+-----+ | y_0 | - | - | - | | y_0 | y_0 | y_0 | y_0 | +-----+-----+-----+-----+ +-----+-----+-----+-----+ | - | y_1 | - | - | | y_1 | y_1 | y_1 | y_1 | +-----+-----+-----+-----+ -- Allgather --> +-----+-----+-----+-----+ | - | - | y_2 | - | | y_2 | y_2 | y_2 | y_2 | +-----+-----+-----+-----+ +-----+-----+-----+-----+ | - | - | - | y_3 | | y_3 | y_3 | y_3 | y_3 | +-----+-----+-----+-----+ +-----+-----+-----+-----+ ``` Example usage: ```python from max.dtype import DType from max.graph import DeviceRef from max.nn import ColumnParallelLinear num_devices = 4 distributed_linear = ColumnParallelLinear( in_dim, out_dim, DType.float32, devices=[DeviceRef.GPU(i) for i in range(num_devices)], ) ``` **Parameters:** * **in\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the input space. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the output space. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias. * **devices** (`Sequence` `[` `DeviceRef` `]` ) – The target devices for computation. Weights remain on CPU until sharded and moved to device during computation. * **tied\_weight** ([`Weight`](../graph/Weight.md#max.graph.Weight) `|` `None` ) ## `DistributedMLP` {#max.nn.linear.DistributedMLP} > *class* max.nn.linear.DistributedMLP(\*args, \*\*kwargs) A distributed multi-layer perceptron. This class has the same state keys as the non-distributed MLP Layer. **Parameters:** * **dtype** – DType to use for the layer weights, which should match the input dtype. * **quantization\_encoding** – Quantization encoding of the layer weights. * **hidden\_dim** – The last dimension of the layer input. * **feed\_forward\_length** – Size of dimension used to project the inputs. * **linear\_cls** – Linear class to use to create the projection layers. * **devices** – Devices to run the MLP layer. If multiple are provided, the first device is used instead. Use DistributedMLP to use all devices. * **activation\_function** – Activation function to use. Options are: * “silu” * “gelu” * “gelu\_tanh” * “relu” * “tanh” * “sigmoid” ## `Float8Config` {#max.nn.linear.Float8Config} > *class* max.nn.linear.Float8Config(input\_scale, weight\_scale, mlp\_in\_float8, attn\_qkv\_in\_float8, embedding\_output\_dtype=None, quant\_method=None) Configures float8 quantization settings for a layer or model section. **Parameters:** * **input\_scale** ([`Float8InputScaleSpec`](#max.nn.linear.Float8InputScaleSpec) ) * **weight\_scale** ([`Float8WeightScaleSpec`](#max.nn.linear.Float8WeightScaleSpec) ) * **mlp\_in\_float8** ([`set`](https://docs.python.org/3/library/stdtypes.html#set) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **attn\_qkv\_in\_float8** ([`set`](https://docs.python.org/3/library/stdtypes.html#set) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **embedding\_output\_dtype** ([`DType`](../dtype.md#max.dtype.DType) `|` `None` ) * **quant\_method** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) ### `attn_qkv_in_float8` {#max.nn.linear.Float8Config.attn_qkv_in_float8} > attn\_qkv\_in\_float8\*: [set](https://docs.python.org/3/library/stdtypes.html#set)\[[int](https://docs.python.org/3/library/functions.html#int)]\* Set of layer indices with attention QKV projections in float8. QKV projections are considered to be either “all quantized” or all not quantized per layer. So either all of {q,k,v,o}\_proj are float8, or all bfloat16. ### `embedding_output_dtype` {#max.nn.linear.Float8Config.embedding_output_dtype} > embedding\_output\_dtype\*: [DType](../dtype.md#max.dtype.DType) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The data type of the output from the embedding layer. ### `input_scale` {#max.nn.linear.Float8Config.input_scale} > input\_scale\*: [Float8InputScaleSpec](#max.nn.linear.Float8InputScaleSpec)\* Specification for input activation scaling. ### `is_dynamic` {#max.nn.linear.Float8Config.is_dynamic} > *property* is\_dynamic\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Returns true if this input scale is dynamic. ### `is_static` {#max.nn.linear.Float8Config.is_static} > *property* is\_static\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Returns true if this input scale is static. ### `mlp_in_float8` {#max.nn.linear.Float8Config.mlp_in_float8} > mlp\_in\_float8\*: [set](https://docs.python.org/3/library/stdtypes.html#set)\[[int](https://docs.python.org/3/library/functions.html#int)]\* Set of layer indices with MLPs in float8. MLPs are considered to be either “all quantized” or all not quantized per layer. So either all of gate proj, down proj, and up proj are float8, or all bfloat16. ### `quant_method` {#max.nn.linear.Float8Config.quant_method} > quant\_method\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The quantization method used (e.g., “fbgemm\_fp8”). ### `weight_scale` {#max.nn.linear.Float8Config.weight_scale} > weight\_scale\*: [Float8WeightScaleSpec](#max.nn.linear.Float8WeightScaleSpec)\* Specification for weight scaling. ## `Float8InputScaleSpec` {#max.nn.linear.Float8InputScaleSpec} > *class* max.nn.linear.Float8InputScaleSpec(granularity, origin, dtype, activation\_scale\_ub=None) Specifies how input activations are scaled for float8 quantization. **Parameters:** * **granularity** ([`Float8ScaleGranularity`](#max.nn.linear.Float8ScaleGranularity) ) * **origin** ([`Float8ScaleOrigin`](#max.nn.linear.Float8ScaleOrigin) ) * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) * **activation\_scale\_ub** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) ### `activation_scale_ub` {#max.nn.linear.Float8InputScaleSpec.activation_scale_ub} > activation\_scale\_ub\*: [float](https://docs.python.org/3/library/functions.html#float) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* An optional upper bound for dynamic activation scaling. ### `dtype` {#max.nn.linear.Float8InputScaleSpec.dtype} > dtype\*: [DType](../dtype.md#max.dtype.DType)\* The data type of the input scale factor(s). ### `granularity` {#max.nn.linear.Float8InputScaleSpec.granularity} > granularity\*: [Float8ScaleGranularity](#max.nn.linear.Float8ScaleGranularity)\* The granularity of the input scale factor application. ### `origin` {#max.nn.linear.Float8InputScaleSpec.origin} > origin\*: [Float8ScaleOrigin](#max.nn.linear.Float8ScaleOrigin)\* The origin (static or dynamic) of the input scale factor. ## `Float8ScaleGranularity` {#max.nn.linear.Float8ScaleGranularity} > *class* max.nn.linear.Float8ScaleGranularity(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) Specifies the granularity of the quantization scale factor. Determines whether a scale factor applies per-tensor, per-row (often for weights), per-column, or per-block within a tensor. ### `BLOCK` {#max.nn.linear.Float8ScaleGranularity.BLOCK} > BLOCK *= 'block'* ### `COLWISE` {#max.nn.linear.Float8ScaleGranularity.COLWISE} > COLWISE *= 'colwise'* ### `ROWWISE` {#max.nn.linear.Float8ScaleGranularity.ROWWISE} > ROWWISE *= 'rowwise'* ### `TENSOR` {#max.nn.linear.Float8ScaleGranularity.TENSOR} > TENSOR *= 'tensor'* ## `Float8ScaleOrigin` {#max.nn.linear.Float8ScaleOrigin} > *class* max.nn.linear.Float8ScaleOrigin(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) Specifies whether the quantization scale is determined statically or dynamically. STATIC scales are pre-computed and loaded with the model weights. DYNAMIC scales are computed at runtime based on the input data. ### `DYNAMIC` {#max.nn.linear.Float8ScaleOrigin.DYNAMIC} > DYNAMIC *= 'dynamic'* ### `STATIC` {#max.nn.linear.Float8ScaleOrigin.STATIC} > STATIC *= 'static'* ## `Float8WeightScaleSpec` {#max.nn.linear.Float8WeightScaleSpec} > *class* max.nn.linear.Float8WeightScaleSpec(granularity, dtype) Specifies how weights are scaled for float8 quantization. **Parameters:** * **granularity** ([`Float8ScaleGranularity`](#max.nn.linear.Float8ScaleGranularity) ) * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) ### `dtype` {#max.nn.linear.Float8WeightScaleSpec.dtype} > dtype\*: [DType](../dtype.md#max.dtype.DType)\* The data type of the weight scale factor(s). ### `granularity` {#max.nn.linear.Float8WeightScaleSpec.granularity} > granularity\*: [Float8ScaleGranularity](#max.nn.linear.Float8ScaleGranularity)\* The granularity of the weight scale factor application. ### `is_block` {#max.nn.linear.Float8WeightScaleSpec.is_block} > *property* is\_block\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Whether the weight scale granularity is block-wise. ### `is_colwise` {#max.nn.linear.Float8WeightScaleSpec.is_colwise} > *property* is\_colwise\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Whether the weight scale granularity is column-wise. ### `is_rowwise` {#max.nn.linear.Float8WeightScaleSpec.is_rowwise} > *property* is\_rowwise\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Whether the weight scale granularity is row-wise. ### `is_tensor` {#max.nn.linear.Float8WeightScaleSpec.is_tensor} > *property* is\_tensor\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Whether the weight scale granularity is per-tensor. ## `GPTQLinear` {#max.nn.linear.GPTQLinear} > *class* max.nn.linear.GPTQLinear(in\_dim, out\_dim, dtype, device, has\_bias=False, quantization\_encoding=None, quantization\_config=None, float8\_config=None) A Linear layer for GPTQ encoding Initializes the linear layer with weights and optional bias with GPTQ quantization. **Parameters:** * **in\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the input space. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the output space. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias. * **device** (`DeviceRef` ) – The target device for computation. Weights remain on CPU until moved during computation. * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer. Defaults to [`False`](https://docs.python.org/3/library/constants.html#False). * **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `|` `None` ) – The quantization encoding of the weights. * **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig) `|` `None` ) – Extra config for the weight quantization. * **float8\_config** ([`Float8Config`](#max.nn.linear.Float8Config) `|` `None` ) ## `GPTQLinearV1` {#max.nn.linear.GPTQLinearV1} > *class* max.nn.linear.GPTQLinearV1(weight, bias=None, quantization\_encoding=None, quantization\_config=None, perm\_idx=None) A Linear layer for GPTQ encoding **Parameters:** * **weight** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `|` `None` ) * **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig) `|` `None` ) * **perm\_idx** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) ### `perm_idx` {#max.nn.linear.GPTQLinearV1.perm_idx} > perm\_idx\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `quantization_config` {#max.nn.linear.GPTQLinearV1.quantization_config} > quantization\_config\*: [QuantizationConfig](../graph/quantization.md#max.graph.quantization.QuantizationConfig) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ## `Linear` {#max.nn.linear.Linear} > *class* max.nn.linear.Linear(in\_dim, out\_dim, dtype, device, has\_bias=False, quantization\_encoding=None, float8\_config=None, name=None, clip\_weight=None) Applies a linear transformation to incoming data: $y = xW^T + b$. This layer implements a fully connected layer where inputs are multiplied by a weight matrix and optionally added with a bias vector. Both weights and bias initially reside on CPU, and the model init phase moves them to [`device`](#max.nn.linear.Linear.device). Example: ```python linear_layer = Linear( in_dim=256, out_dim=128, dtype=DType.float32, device=DeviceRef.GPU(), name="linear", has_bias=True ) input_tensor: TensorValue output = linear_layer(input_tensor) ``` Initializes the linear layer with weights and optional bias. **Parameters:** * **in\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the input space. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the output space. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias. * **device** (`DeviceRef` ) – The target device for computation. Weights remain on CPU until moved during computation. * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) – Base name for weights (appended with `.weight` and `.bias` if applicable). * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer. Defaults to [`False`](https://docs.python.org/3/library/constants.html#False). * **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `|` `None` ) * **float8\_config** ([`Float8Config`](#max.nn.linear.Float8Config) `|` `None` ) * **clip\_weight** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) ### `bias` {#max.nn.linear.Linear.bias} > bias\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The optional bias vector stored on CPU with shape (out\_dim,). Model init moves the bias to [`device`](#max.nn.linear.Linear.device) if present. ### `device` {#max.nn.linear.Linear.device} > device\*: DeviceRef\* The device where matrix operations are performed. ### `input_scale` {#max.nn.linear.Linear.input_scale} > input\_scale\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The optional input scale stored on CPU with shape (). Model init moves the input\_scale to [`device`](#max.nn.linear.Linear.device) if present. ### `set_sharding()` {#max.nn.linear.Linear.set_sharding} > set\_sharding(strategy) Sets the weight sharding for this linear layer. **Parameters:** **strategy** (`ShardingStrategy` ) – The strategy describing the weight sharding. **Return type:** None ### `weight` {#max.nn.linear.Linear.weight} > weight\*: [Weight](../graph/Weight.md#max.graph.Weight)\* The weight matrix stored on CPU with shape (out\_dim, in\_dim). Model init transposes the weight and moves it to [`device`](#max.nn.linear.Linear.device). ### `weight_scale` {#max.nn.linear.Linear.weight_scale} > weight\_scale\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* The optional weight scale stored on CPU with shape () or (N,). Model init moves the weight\_scale to [`device`](#max.nn.linear.Linear.device) if present. ## `LinearV1` {#max.nn.linear.LinearV1} > *class* max.nn.linear.LinearV1(weight, bias=None) A unified linear layer that delegates to either regular or quantized implementation. Deprecated: Use Linear instead. **Parameters:** * **weight** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) ### `bias` {#max.nn.linear.LinearV1.bias} > bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ### `create()` {#max.nn.linear.LinearV1.create} > *classmethod* create(dtype, quantization\_encoding, in\_features, out\_features, weights, bias=None, quantization\_config=None) Factory method to create a Linear layer with appropriate implementation. **Parameters:** * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) * **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `|` `None` ) * **in\_features** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **out\_features** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **weights** (`Weights` `|` [`Weight`](../graph/Weight.md#max.graph.Weight) ) * **bias** (`Weights` `|` [`Weight`](../graph/Weight.md#max.graph.Weight) `|` `None` ) * **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig) `|` `None` ) **Return type:** [*LinearV1*](#max.nn.linear.LinearV1) ### `weight` {#max.nn.linear.LinearV1.weight} > weight\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ## `MLP` {#max.nn.linear.MLP} > *class* max.nn.linear.MLP(dtype, quantization\_encoding, hidden\_dim, feed\_forward\_length, devices, linear\_cls=\, has\_bias=False, activation\_function='silu', float8\_config=None) Simple multi-layer perceptron composed of three linear layers. Defaults to SiLU activation function. **Parameters:** * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – DType to use for the layer weights, which should match the input dtype. * **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `|` `None` ) – Quantization encoding of the layer weights. * **hidden\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The last dimension of the layer input. * **feed\_forward\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Size of dimension used to project the inputs. * **linear\_cls** (`Callable` `[` `...` `,` [`Linear`](#max.nn.linear.Linear) `]` ) – Linear class to use to create the projection layers. * **devices** (`Sequence` `[` `DeviceRef` `]` ) – Devices to run the MLP layer. If multiple are provided, the first device is used instead. Use DistributedMLP to use all devices. * **activation\_function** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Activation function to use. Options are: * “silu” * “gelu” * “gelu\_tanh” * “relu” * “tanh” * “sigmoid” * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **float8\_config** ([`Float8Config`](#max.nn.linear.Float8Config) `|` `None` ) ## `MLPV1` {#max.nn.linear.MLPV1} > *class* max.nn.linear.MLPV1(gate\_proj, down\_proj, up\_proj) Simple multi-layer perceptron composed of three linear layers. Uses SiLU activation function. **Parameters:** * **gate\_proj** ([`LinearV1`](#max.nn.linear.LinearV1) ) * **down\_proj** ([`LinearV1`](#max.nn.linear.LinearV1) ) * **up\_proj** ([`LinearV1`](#max.nn.linear.LinearV1) ) ### `down_proj` {#max.nn.linear.MLPV1.down_proj} > down\_proj\*: [LinearV1](#max.nn.linear.LinearV1)\* ### `gate_proj` {#max.nn.linear.MLPV1.gate_proj} > gate\_proj\*: [LinearV1](#max.nn.linear.LinearV1)\* ### `up_proj` {#max.nn.linear.MLPV1.up_proj} > up\_proj\*: [LinearV1](#max.nn.linear.LinearV1)\* ## `QLinearV1` {#max.nn.linear.QLinearV1} > *class* max.nn.linear.QLinearV1(weight, bias=None, quantization\_encoding=None) A quantized fully connected layer. **Parameters:** * **weight** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `|` `None` ) ### `quantization_encoding` {#max.nn.linear.QLinearV1.quantization_encoding} > quantization\_encoding\*: [QuantizationEncoding](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* --- ## linear_filter `linear_filter(x: SIMD[float32, 1]) -> SIMD[float32, 1]` This is a tent filter. f(x) = 1 + x, x = 1 --- ## linked_list ## Structs * [​`LinkedList`](/mojo/stdlib/collections/linked_list/LinkedList): A doubly-linked list implementation. * [​`Node`](/mojo/stdlib/collections/linked_list/Node): A node in a linked list data structure. --- ## LinkedList `struct LinkedList[ElementType: Copyable & Movable]` A doubly-linked list implementation. A doubly-linked list is a data structure where each element points to both the next and previous elements, allowing for efficient insertion and deletion at any position. ## Parameters * ​ElementType (`Copyable & Movable`): The type of elements stored in the list. Must implement the `Copyable` and `Movable` traits. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize an empty linked list. Notes: Time Complexity: O(1). `__init__(out self, owned *elements: ElementType)` Initialize a linked list with the given elements. Notes: Time Complexity: O(n) in len(elements). **Args:** * ​\*elements (`ElementType`): Variable number of elements to initialize the list with. `__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])` Construct a list from a `VariadicListMem`. Notes: Time Complexity: O(n) in len(elements). **Args:** * ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The elements to add to the list. ### `__copyinit__` `__copyinit__(out self, other: Self)` Initialize this list as a copy of another list. Notes: Time Complexity: O(n) in len(elements). **Args:** * ​other (`Self`): The list to copy from. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Initialize this list by moving elements from another list. Notes: Time Complexity: O(1). **Args:** * ​other (`Self`): The list to move elements from. ### `__del__` `__del__(owned self)` Clean up the list by freeing all nodes. Notes: Time Complexity: O(n) in len(self). ### `__bool__` `__bool__(self) -> Bool` Check if the list is non-empty. Notes: Time Complexity: O(1). **Returns:** True if the list has elements, False otherwise. ### `__getitem__` `__getitem__[I: Indexer](ref self, index: I) -> ref [self] ElementType` Get the element at the specified index. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​index (`I`): The index of the element to get. **Returns:** The element at the specified index. ### `__setitem__` `__setitem__[I: Indexer](mut self, index: I, owned value: ElementType)` Set the element at the specified index. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​index (`I`): The index of the element to set. * ​value (`ElementType`): The new value to set. ### `__eq__` `__eq__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool` Checks if the two lists are equal. Notes: Time Complexity: O(n) in min(len(self), len(other)) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​other (`LinkedList[ElementType]`): The list to compare to. **Returns:** Whether the lists are equal. ### `__ne__` `__ne__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool` Checks if the two lists are not equal. Notes: Time Complexity: O(n) in min(len(self), len(other)) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​other (`LinkedList[ElementType]`): The list to compare to. **Returns:** Whether the lists are not equal. ### `__contains__` `__contains__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], value: ElementType) -> Bool` Checks if the list contains `value`. Notes: Time Complexity: O(n) in len(self) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​value (`ElementType`): The value to search for in the list. **Returns:** Whether the list contains `value`. ### `append` `append(mut self, owned value: ElementType)` Add an element to the end of the list. Notes: Time Complexity: O(1). **Args:** * ​value (`ElementType`): The value to append. ### `prepend` `prepend(mut self, owned value: ElementType)` Add an element to the beginning of the list. Notes: Time Complexity: O(1). **Args:** * ​value (`ElementType`): The value to prepend. ### `reverse` `reverse(mut self)` Reverse the order of elements in the list. Notes: Time Complexity: O(n) in len(self). ### `pop` `pop(mut self) -> ElementType` Remove and return the last element of the list. Notes: Time Complexity: O(1). **Returns:** The last element in the list. `pop[I: Indexer](mut self, owned i: I) -> ElementType` Remove the ith element of the list, counting from the tail if given a negative index. Notes: Time Complexity: O(1). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​i (`I`): The index of the element to get. **Returns:** Ownership of the indicated element. ### `maybe_pop` `maybe_pop(mut self) -> Optional[ElementType]` Removes the head of the list and returns it, if it exists. Notes: Time Complexity: O(1). **Returns:** The head of the list, if it was present. `maybe_pop[I: Indexer](mut self, owned i: I) -> Optional[ElementType]` Remove the ith element of the list, counting from the tail if given a negative index. Notes: Time Complexity: O(1). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​i (`I`): The index of the element to get. **Returns:** The element, if it was found. ### `clear` `clear(mut self)` Removes all elements from the list. Notes: Time Complexity: O(n) in len(self). ### `copy` `copy(self) -> Self` Create a deep copy of the list. Notes: Time Complexity: O(n) in len(self). **Returns:** A new list containing copies of all elements. ### `insert` `insert[I: Indexer](mut self, idx: I, owned elem: ElementType)` Insert an element `elem` into the list at index `idx`. Notes: Time Complexity: O(1). **Parameters:** * ​I (`Indexer`): The type of index to use. **Args:** * ​idx (`I`): The index to insert `elem` at `-len(self) elem (`ElementType`): The item to insert into the list. **Raises:** When given an out of bounds index. ### `extend` `extend(mut self, owned other: Self)` Extends the list with another. Notes: Time Complexity: O(1). **Args:** * ​other (`Self`): The list to append to this one. ### `count` `count[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], elem: ElementType) -> UInt` Count the occurrences of `elem` in the list. Notes: Time Complexity: O(n) in len(self) compares. **Parameters:** * ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the function. **Args:** * ​elem (`ElementType`): The element to search for. **Returns:** The number of occurrences of `elem` in the list. ### `__len__` `__len__(self) -> Int` Get the number of elements in the list. Notes: Time Complexity: O(1). **Returns:** The number of elements in the list. ### `__iter__` `__iter__(self) -> _LinkedListIter[ElementType, self]` Iterate over elements of the list, returning immutable references. Notes: Time Complexity: * O(1) for iterator construction. * O(n) in len(self) for a complete iteration of the list. **Returns:** An iterator of immutable references to the list elements. ### `__reversed__` `__reversed__(self) -> _LinkedListIter[ElementType, self, False]` Iterate backwards over the list, returning immutable references. Notes: Time Complexity: * O(1) for iterator construction. * O(n) in len(self) for a complete iteration of the list. **Returns:** A reversed iterator of immutable references to the list elements. ### `__str__` `__str__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String` Convert the list to its string representation. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when `ElementType` is `Writable`. **Returns:** String representation of the list. ### `__repr__` `__repr__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String` Convert the list to its string representation. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when `ElementType` is `Writable`. **Returns:** String representation of the list. ### `write_to` `write_to[W: Writer, ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType], mut writer: W)` Write the list to the given writer. Notes: Time Complexity: O(n) in len(self). **Parameters:** * ​W (`Writer`): The type of writer to write the list to. * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when `ElementType` is `Writable`. **Args:** * ​writer (`W`): The writer to write the list to. --- ## list Defines the List type. These APIs are imported automatically, just like builtins. ## Structs * [​`List`](/mojo/stdlib/collections/list/List): The `List` type is a dynamically-allocated list. --- ## List `struct List[T: Copyable & Movable, hint_trivial_type: Bool = False]` The `List` type is a dynamically-allocated list. Notes: It supports pushing and popping from the back resizing the underlying storage as needed. When it is deallocated, it frees its memory. ## Parameters * ​T (`Copyable & Movable`): The type of the elements. * ​hint\_trivial\_type (`Bool`): A hint to the compiler that the type T is trivial. It's not mandatory, but if set, it allows some optimizations. ## Fields * ​data (`UnsafePointer[T]`): The underlying storage for the list. * ​capacity (`Int`): The amount of elements that can fit in the list without resizing it. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Constructs an empty list. `__init__(out self, *, capacity: Int)` Constructs a list with the given capacity. **Args:** * ​capacity (`Int`): The requested capacity of the list. `__init__(out self, *, length: UInt, fill: T)` Constructs a list with the given capacity. **Args:** * ​length (`UInt`): The requested length of the list. * ​fill (`T`): The element to fill each element of the list. `__init__(out self, owned *values: T, *, __list_literal__: Tuple[] = Tuple())` Constructs a list from the given values. **Args:** * ​\*values (`T`): The values to populate the list with. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. `__init__(out self, *, owned elements: VariadicListMem[T, origin, is_owned])` Constructs a list from the given values. **Args:** * ​elements (`VariadicListMem[T, origin, is_owned]`): The values to populate the list with. `__init__(out self, span: Span[T, origin])` Constructs a list from the a Span of values. **Args:** * ​span (`Span[T, origin]`): The span of values to populate the list with. `__init__(out self, *, unsafe_uninit_length: Int)` Construct a list with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data. **Args:** * ​unsafe\_uninit\_length (`Int`): The number of elements to allocate. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a deepcopy of the given list. **Args:** * ​existing (`Self`): The list to copy. ### `__del__` `__del__(owned self)` Destroy all elements in the list and free its memory. ### `__bool__` `__bool__(self) -> Bool` Checks whether the list has any elements or not. **Returns:** `False` if the list is empty, `True` if there is at least one element. ### `__getitem__` `__getitem__(self, slice: Slice) -> Self` Gets the sequence of elements at the specified positions. **Args:** * ​slice (`Slice`): A slice that specifies positions of the new list. **Returns:** A new list containing the list at the specified slice. `__getitem__[I: Indexer](ref self, idx: I) -> ref [self] T` Gets the list element at the given index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index of the element. **Returns:** A reference to the element at the given index. ### `__eq__` `__eq__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool` Checks if two lists are equal. Examples: ```mojo var x = List[Int](1, 2, 3) var y = List[Int](1, 2, 3) print("x and y are equal" if x == y else "x and y are not equal") ``` **Parameters:** * ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​other (`List[U, hint_trivial_type]`): The list to compare with. **Returns:** True if the lists are equal, False otherwise. ### `__ne__` `__ne__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool` Checks if two lists are not equal. Examples: ```mojo var x = List[Int](1, 2, 3) var y = List[Int](1, 2, 4) print("x and y are not equal" if x != y else "x and y are equal") ``` **Parameters:** * ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​other (`List[U, hint_trivial_type]`): The list to compare with. **Returns:** True if the lists are not equal, False otherwise. ### `__contains__` `__contains__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], value: U) -> Bool` Verify if a given value is present in the list. Examples: ```mojo var x = List[Int](1,2,3) print("x contains 3" if 3 in x else "x does not contain 3") ``` **Parameters:** * ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​value (`U`): The value to find. **Returns:** True if the value is contained in the list, False otherwise. ### `__add__` `__add__(self, owned other: Self) -> Self` Concatenates self with other and returns the result as a new list. **Args:** * ​other (`Self`): List whose elements will be combined with the elements of self. **Returns:** The newly created list. ### `__mul__` `__mul__(self, x: Int) -> Self` Multiplies the list by x and returns a new list. **Args:** * ​x (`Int`): The multiplier number. **Returns:** The new list. ### `__iadd__` `__iadd__(mut self, owned other: Self)` Appends the elements of other into self. **Args:** * ​other (`Self`): List whose elements will be appended to self. ### `__imul__` `__imul__(mut self, x: Int)` Appends the original elements of this list x-1 times or clears it if x is x (`Int`): The multiplier number. ### `copy` `copy(self) -> Self` Creates a deep copy of the given list. **Returns:** A copy of the value. ### `__iter__` `__iter__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin]` Iterate over elements of the list, returning immutable references. **Returns:** An iterator of immutable references to the list elements. ### `__reversed__` `__reversed__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin, False]` Iterate backwards over the list, returning immutable references. **Returns:** A reversed iterator of immutable references to the list elements. ### `__len__` `__len__(self) -> Int` Gets the number of elements in the list. **Returns:** The number of elements in the list. ### `__str__` `__str__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String` Returns a string representation of a `List`. Notes: Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo var my_list = List[Int](1, 2, 3) print(my_list.__str__()) ``` When the compiler supports conditional methods, then a simple `String(my_list)` will be enough. **Parameters:** * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `Representable`. **Returns:** A string representation of the list. ### `write_to` `write_to[W: Writer, U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type], mut writer: W)` Write `my_list.__str__()` to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. * ​U (`Representable & Copyable & Movable`): The type of the List elements. Must have the trait `Representable`. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String` Returns a string representation of a `List`. Notes: Note that since we can't condition methods on a trait yet, the way to call this method is a bit special. Here is an example below: ```mojo var my_list = List[Int](1, 2, 3) print(my_list.__repr__()) ``` When the compiler supports conditional methods, then a simple `repr(my_list)` will be enough. **Parameters:** * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `Representable`. **Returns:** A string representation of the list. ### `byte_length` `byte_length(self) -> Int` Gets the byte length of the List (`len(self) * sizeof[T]()`). **Returns:** The byte length of the List (`len(self) * sizeof[T]()`). ### `append` `append(mut self, owned value: T)` Appends a value to this list. Notes: If there is no capacity left, resizes to twice the current capacity. Except for 0 capacity where it sets 1. **Args:** * ​value (`T`): The value to append. `append(mut self, elements: Span[T, origin])` Appends elements to this list. **Args:** * ​elements (`Span[T, origin]`): The elements to append. ### `insert` `insert(mut self, i: Int, owned value: T)` Inserts a value to the list at the given index. `a.insert(len(a), value)` is equivalent to `a.append(value)`. **Args:** * ​i (`Int`): The index for the value. * ​value (`T`): The value to insert. ### `extend` `extend(mut self, owned other: List[T, hint_trivial_type])` Extends this list by consuming the elements of `other`. **Args:** * ​other (`List[T, hint_trivial_type]`): List whose elements will be added in order at the end of this list. `extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size])` Extends this list with the elements of a vector. Notes: If there is no capacity left, resizes to `len(self) + value.size`. **Parameters:** * ​D (`DType`): The DType. **Args:** * ​value (`SIMD[D, size]`): The value to append. `extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size], *, count: Int)` Extends this list with `count` number of elements from a vector. Notes: If there is no capacity left, resizes to `len(self) + count`. **Parameters:** * ​D (`DType`): The DType. **Args:** * ​value (`SIMD[D, size]`): The value to append. * ​count (`Int`): The ammount of items to append. Must be less than or equal to `value.size`. `extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: Span[SIMD[D, 1], origin])` Extends this list with the elements of a `Span`. Notes: If there is no capacity left, resizes to `len(self) + len(value)`. **Parameters:** * ​D (`DType`): The DType. **Args:** * ​value (`Span[SIMD[D, 1], origin]`): The value to append. ### `pop` `pop(mut self, i: Int = -1) -> T` Pops a value from the list at the given index. **Args:** * ​i (`Int`): The index of the value to pop. **Returns:** The popped value. ### `reserve` `reserve(mut self, new_capacity: Int)` Reserves the requested capacity. Notes: If the current capacity is greater or equal, this is a no-op. Otherwise, the storage is reallocated and the date is moved. **Args:** * ​new\_capacity (`Int`): The new capacity. ### `resize` `resize(mut self, new_size: Int, value: T)` Resizes the list to the given new size. Notes: If the new size is smaller than the current one, elements at the end are discarded. If the new size is larger than the current one, the list is appended with new values elements up to the requested size. **Args:** * ​new\_size (`Int`): The new size. * ​value (`T`): The value to use to populate new elements. `resize(mut self, *, unsafe_uninit_length: Int)` Resizes the list to the given new size leaving any new elements uninitialized. If the new size is smaller than the current one, elements at the end are discarded. If the new size is larger than the current one, the list is extended and the new elements are left uninitialized. **Args:** * ​unsafe\_uninit\_length (`Int`): The new size. ### `shrink` `shrink(mut self, new_size: Int)` Resizes to the given new size which must be new\_size (`Int`): The new size. ### `reverse` `reverse(mut self)` Reverses the elements of the list. ### `index` `index[C: EqualityComparable & Copyable & Movable, //](ref self: List[C, hint_trivial_type], value: C, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int` Returns the index of the first occurrence of a value in a list restricted by the range given the start and stop bounds. Examples: ```mojo var my_list = List[Int](1, 2, 3) print(my_list.index(2)) # prints `1` ``` **Parameters:** * ​C (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the `EqualityComparable` trait. **Args:** * ​value (`C`): The value to search for. * ​start (`Int`): The starting index of the search, treated as a slice index (defaults to 0). * ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index (defaults to None, which means the end of the list). **Returns:** The index of the first occurrence of the value in the list. **Raises:** ValueError: If the value is not found in the list. ### `clear` `clear(mut self)` Clears the elements in the list. ### `steal_data` `steal_data(mut self) -> UnsafePointer[T]` Take ownership of the underlying pointer from the list. **Returns:** The underlying data. ### `unsafe_get` `unsafe_get(ref self, idx: Int) -> ref [self] T` Get a reference to an element of self without checking index bounds. Notes: Users should consider using `__getitem__` instead of this method as it is unsafe. If an index is out of bounds, this method will not abort, it will be considered undefined behavior. Note that there is no wraparound for negative indices, caution is advised. Using negative indices is considered undefined behavior. Never use `my_list.unsafe_get(-1)` to get the last element of the list. Instead, do `my_list.unsafe_get(len(my_list) - 1)`. **Args:** * ​idx (`Int`): The index of the element to get. **Returns:** A reference to the element at the given index. ### `unsafe_set` `unsafe_set(mut self, idx: Int, owned value: T)` Write a value to a given location without checking index bounds. Notes: Users should consider using `my_list[idx] = value` instead of this method as it is unsafe. If an index is out of bounds, this method will not abort, it will be considered undefined behavior. Note that there is no wraparound for negative indices, caution is advised. Using negative indices is considered undefined behavior. Never use `my_list.unsafe_set(-1, value)` to set the last element of the list. Instead, do `my_list.unsafe_set(len(my_list) - 1, value)`. **Args:** * ​idx (`Int`): The index of the element to set. * ​value (`T`): The value to set. ### `count` `count[T: EqualityComparable & Copyable & Movable, //](self: List[T, hint_trivial_type], value: T) -> Int` Counts the number of occurrences of a value in the list. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the trait `EqualityComparable`. **Args:** * ​value (`T`): The value to count. **Returns:** The number of occurrences of the value in the list. ### `swap_elements` `swap_elements(mut self, elt_idx_1: Int, elt_idx_2: Int)` Swaps elements at the specified indexes if they are different. Examples: ```mojo var my_list = List[Int](1, 2, 3) my_list.swap_elements(0, 2) print(my_list.__str__()) # 3, 2, 1 ``` Notes: This is useful because `swap(my_list[i], my_list[j])` cannot be supported by Mojo, because a mutable alias may be formed. **Args:** * ​elt\_idx\_1 (`Int`): The index of one element. * ​elt\_idx\_2 (`Int`): The index of the other element. ### `unsafe_ptr` `unsafe_ptr(ref self) -> UnsafePointer[T, mut=self_is_mut, origin=self_is_origin]` Retrieves a pointer to the underlying memory. **Returns:** The pointer to the underlying memory. --- ## listdir `listdir[PathLike: PathLike](path: PathLike) -> List[String]` Gets the list of entries contained in the path provided. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns the list of entries in the path provided. --- ## llvm_intrinsic `llvm_intrinsic[intrin: StringSlice[StaticConstantOrigin], type: AnyTrivialRegType, *types: AnyType, *, has_side_effect: Bool = True](*args: *types) -> type` Calls an LLVM intrinsic with the name `intrin` and return type `type`. **Parameters:** * ​intrin (`StringSlice[StaticConstantOrigin]`): The name of the llvm intrinsic. * ​type (`AnyTrivialRegType`): The return type of the intrinsic. * ​\*types (`AnyType`): The argument types for the function. * ​has\_side\_effect (`Bool`): If `True` the intrinsic will have side effects, otherwise its pure. **Args:** * ​\*args (`*types`): The arguments to the function. **Returns:** The result of calling the llvm intrinsic with no arguments. --- ## load `load[type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]` Loads data from global memory into a SIMD vector. Provides a high-level interface for vectorized memory loads with configurable cache behavior and memory access patterns. **Parameters:** * ​type (`DType`): The data type to load. * ​width (`Int`): Vector width (number of elements to load). * ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization. * ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes). * ​cache\_policy (`CacheOperation`): Cache operation policy for the load. * ​eviction\_policy (`CacheEviction`): Cache eviction policy. * ​alignment (`Int`): Memory alignment in bytes. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory to load from. **Returns:** SIMD vector containing the loaded data. `load[OffsetType: Indexer, type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]], offset: OffsetType) -> SIMD[type, width]` Loads data from global memory with an offset into a SIMD vector. Provides a high-level interface for vectorized memory loads with configurable cache behavior and memory access patterns, supporting offset-based addressing. **Parameters:** * ​OffsetType (`Indexer`): Type of the offset value. * ​type (`DType`): The data type to load. * ​width (`Int`): Vector width (number of elements to load). * ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization. * ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes). * ​cache\_policy (`CacheOperation`): Cache operation policy for the load. * ​eviction\_policy (`CacheEviction`): Cache eviction policy. * ​alignment (`Int`): Memory alignment in bytes. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1]]`): Base pointer to global memory. * ​offset (`OffsetType`): Offset from base pointer in elements. **Returns:** SIMD vector containing the loaded data. --- ## load_acquire `load_acquire[type: DType, //, *, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]` Performs an atomic load operation with acquire memory ordering semantics. This function provides a memory barrier that ensures no subsequent memory operations from the calling thread are executed until after this load completes. Note: * Only supported on GPUs. * Maps directly to PTX ld.acquire instruction on NVIDIA, LLVM atomic load on AMDGPU. * Ensures subsequent memory operations don't execute until after load. * Critical for implementing synchronization primitives. **Parameters:** * ​type (`DType`): The data type to load. * ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM). * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. **Returns:** The loaded value. --- ## load_matrix_a `load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 4]` Loads a tile of matrix A from memory to registers for TF32 tensor core operations. **Constraints:** The tile demensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 TF32 values loaded from matrix A in the required order. `load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]` Loads a tile of matrix A from memory to registers for FP16 tensor core operations. **Constraints:** The tile demensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 FP16 values loaded from matrix A in the required order. `load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 2) + -1) if ((k , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]` Loads a tile of matrix A from memory to registers for BF16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing k//2 BF16 values loaded from matrix A in the required order. --- ## load_matrix_a_amd `load_matrix_a_amd[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]` Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=16, k=4. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 1 FP32 value loaded from matrix A. `load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]` Loads a tile of matrix A from memory to registers for AMD FP16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 FP16 values loaded from matrix A. `load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, 4]` Loads a tile of matrix A from memory to registers for AMD BF16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix A (stride between rows). **Returns:** SIMD vector containing 4 BF16 values loaded from matrix A. --- ## load_matrix_b `load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 2]` Loads a tile of matrix B from memory to registers for TF32 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing 2 TF32 values loaded from matrix B in the required order. `load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 2]` Loads a tile of matrix B from memory to registers for FP16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing 2 FP16 values loaded from matrix B in the required order. `load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 4) + -1) if ((k , 4) == 0) ^ True)) else div_s(#lit.struct.extract, 4)]` Loads a tile of matrix B from memory to registers for BF16 tensor core operations. **Constraints:** The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing k//4 BF16 values loaded from matrix B in the required order. --- ## load_matrix_b_amd `load_matrix_b_amd[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]` Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory. * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). **Returns:** SIMD vector containing 1 FP32 value loaded from matrix B. `load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[float16, 4]` Loads a tile of matrix B from memory to registers for AMD FP16 tensor core operations. This function loads 4 consecutive FP16 values per thread from matrix B in a pattern optimized for AMD GPU tensor core operations. Each thread loads values based on its position within the warp. Performance: * Optimized for AMD GPU memory access patterns. * Uses thread ID to determine which elements to load. * Loads 4 consecutive elements per thread for efficient vectorization. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory (FP16 format). * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). * ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension. **Returns:** SIMD vector containing 4 FP16 values loaded from matrix B. `load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[bfloat16, 4]` Loads a tile of matrix B from memory to registers for AMD BF16 tensor core operations. This function loads 4 consecutive BF16 values per thread from matrix B in a pattern optimized for AMD GPU tensor core operations. Each thread loads values based on its position within the warp. Performance: * Optimized for AMD GPU memory access patterns. * Uses thread ID to determine which elements to load. * Loads 4 consecutive elements per thread for efficient vectorization. **Parameters:** * ​m (`Int`): Number of rows in the output matrix tile. * ​n (`Int`): Number of columns in the output matrix tile. * ​k (`Int`): Inner dimension for matrix multiplication. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory (BF16 format). * ​tile\_row (`Int`): Starting row index of the tile. * ​tile\_col (`Int`): Starting column index of the tile. * ​ldm (`Int`): Leading dimension of matrix B (stride between rows). * ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension. **Returns:** SIMD vector containing 4 BF16 values loaded from matrix B. --- ## load_volatile `load_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]` Performs a volatile load operation that cannot be optimized away. This function guarantees that the load operation will be performed exactly as specified, without being reordered or optimized away by the compiler. Note: * Only supported on NVIDIA GPUs. * Maps directly to PTX ld.volatile instruction. * Prevents compiler optimization of the load operation. * Useful for memory-mapped I/O or synchronization primitives. * May have performance implications compared to regular loads. **Parameters:** * ​type (`DType`): The data type to load. * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from. **Returns:** The loaded value. --- ## load_z `load_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## LoadStore_i8mm `struct LoadStore_i8mm[type: DType, simd_size: Int, single_row: Bool, tile_rows: Int, tile_columns: Int]` ## Fields * ​output\_tile (`_Accumulator[type, tile_rows, 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">), simd_size]`): * ​skip\_boundary\_check (`Bool`): ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `num_simd_cols` `alias num_simd_cols = 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">)` ## Methods ### `__init__` `@implicit` `__init__(out self, skip_boundary_check: Bool)` --- ## lock ## Structs * [​`BlockingScopedLock`](/mojo/stdlib/utils/lock/BlockingScopedLock): A scope adapter for BlockingSpinLock. * [​`BlockingSpinLock`](/mojo/stdlib/utils/lock/BlockingSpinLock): A basic locking implementation that uses an integer to represent the owner of the lock. * [​`SpinWaiter`](/mojo/stdlib/utils/lock/SpinWaiter): A proxy for the C++ runtime's SpinWaiter type. --- ## log `log[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise natural log (base E) of a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on. **Returns:** Vector containing result of performing natural log base E on x. --- ## log_probabilities ## `compute_log_probabilities_ragged()` {#max.pipelines.lib.log_probabilities.compute_log_probabilities_ragged} > max.pipelines.lib.log\_probabilities.compute\_log\_probabilities\_ragged(\*, input\_row\_offsets, logits, next\_token\_logits, tokens, sampled\_tokens, batch\_top\_n, batch\_echo) Computes the log probabilities for ragged model outputs. **Parameters:** * **input\_row\_offsets** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – Token offsets into token-indexed buffers, by batch index. Should have 1 more element than there are batches (batch n is token indices \[input\_row\_offsets\[n], input\_row\_offsets\[n+1])). * **logits** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) – (tokens, vocab\_dim) tensor full of tensor logits. Token dimension mapped to batches using input\_row\_offsets. * **next\_token\_logits** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – (batch, vocab\_dim) tensor full of logits for next tokens per batch. * **sampled\_tokens** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – (batch\_dim,) tensor of sampled token per batch * **batch\_top\_n** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Number of top log probabilities to return per input in the batch. For any element where top\_n == 0, the LogProbabilities is skipped. * **batch\_echo** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`bool`](https://docs.python.org/3/library/functions.html#bool) `]` ) – Whether to include input tokens in the returned log probabilities. * **tokens** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** Computed log probabilities for each item in the batch. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*LogProbabilities*](core.md#max.pipelines.core.LogProbabilities) | None] ## `log_softmax()` {#max.pipelines.lib.log_probabilities.log_softmax} > max.pipelines.lib.log\_probabilities.log\_softmax(x, axis=-1) Compute the logarithm of the softmax function. This implementation uses the identity log(softmax(x)) = x - log(sum(exp(x))) with numerical stability improvements to prevent overflow/underflow. **Parameters:** * **x** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – Input array * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Axis to compute values along **Returns:** Array with same shape as x, representing log(softmax(x)) **Return type:** [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) --- ## log10 `log10[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `log10` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `log10` of the input. --- ## log1p `log1p[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `log1p` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `log1p` of the input. --- ## log2 `log2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise log (base 2) of a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on. **Returns:** Vector containing result of performing log base 2 on x. --- ## log2_floor `log2_floor(val: Int) -> Int` Returns the floor of the base-2 logarithm of an integer value. **Args:** * ​val (`Int`): The input value. **Returns:** The floor of the base-2 logarithm of the input value, which is equal to the position of the highest set bit. Returns -1 if val is 0. --- ## logb `logb[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `logb` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `logb` of the input. --- ## logger Provides logging functionality with different severity levels. ## Modules * [​`logger`](/mojo/stdlib/logger/logger/): Provides logging functionality with different severity levels. --- ## logger Provides logging functionality with different severity levels. This module implements a simple logging system with configurable severity levels: `NOTSET`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. The logging level can be set via the LOGGING\_LEVEL environment variable. The main components are: * `Level`: An enum-like struct defining the available logging levels * `Logger`: A struct that handles logging messages with different severity levels Example: ```mojo from logger import Logger var logger = Logger() # Uses default level from LOGGING_LEVEL env var logger.info("Starting process") logger.debug("Debug information") logger.error("An error occurred") ``` The logger can be configured to write to different file descriptors (default stdout). Messages below the configured level will be silently ignored. ## Aliases ### `DEFAULT_LEVEL` `alias DEFAULT_LEVEL = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())` ## Structs * [​`Level`](/mojo/stdlib/logger/logger/Level): Represents logging severity levels. * [​`Logger`](/mojo/stdlib/logger/logger/Logger): A logger that outputs messages at or above a specified severity level. --- ## Logger `struct Logger[level: Level = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())]` A logger that outputs messages at or above a specified severity level. ## Parameters * ​level (`Level`): The minimum severity level for messages to be logged. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, fd: FileDescriptor = FileDescriptor(1))` Initializes a new Logger. **Args:** * ​fd (`FileDescriptor`): The file descriptor to write log messages to (defaults to stdout). ### `debug` `debug[*Ts: Writable](self, *values: *Ts)` Logs a debug message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `info` `info[*Ts: Writable](self, *values: *Ts)` Logs an info message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `warning` `warning[*Ts: Writable](self, *values: *Ts)` Logs a warning message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `error` `error[*Ts: Writable](self, *values: *Ts)` Logs an error message. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. ### `critical` `critical[*Ts: Writable](self, *values: *Ts)` Logs a critical message and aborts execution. **Parameters:** * ​\*Ts (`Writable`): The types of values to log. **Args:** * ​\*values (`*Ts`): The values to log. --- ## logical_divide `logical_divide(layout_a: Layout, _layout_b: Layout) -> Layout` Divides a layout into blocks according to another layout. This function creates a hierarchical layout by dividing the first layout according to the second layout. It's useful for creating blocked or tiled representations of tensors. **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​\_layout\_b (`Layout`): The layout defining the division pattern. **Returns:** A new layout representing the hierarchical division. `logical_divide(layout_a: Layout, tiler: List[Layout]) -> Layout` Divides a layout into blocks according to a list of layouts. This is a variant of logical\_divide that works with a list of layouts for more complex tiling patterns. **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​tiler (`List[Layout]`): A list of layouts defining the division patterns. **Returns:** A new layout representing the hierarchical division. --- ## logical_product `logical_product(_layout_a: Layout, layout_b: Layout) -> Layout` Creates a product of two layouts. This function creates a hierarchical layout by taking the logical product of two layouts. It's a fundamental operation for creating blocked or tiled layouts. **Args:** * ​\_layout\_a (`Layout`): The first layout. * ​layout\_b (`Layout`): The second layout. **Returns:** A new layout representing the logical product of the two layouts. `logical_product(layout_a: Layout, tiler: List[Layout]) -> Layout` Creates a product of a layout with a list of layouts. This is a variant of logical\_product that works with a list of layouts for more complex tiling patterns. It applies the logical\_product operation to each element of the layout with the corresponding element in the tiler list. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import logical_product # Create a product of a layout with a list of layouts var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2))) var result = logical_product(base, tilers) ``` . **Args:** * ​layout\_a (`Layout`): The base layout to create products with. * ​tiler (`List[Layout]`): A list of layouts defining the product patterns. **Returns:** A new layout representing the logical product with the tiler layouts. --- ## logsoftmax `logsoftmax[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])` Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm. The unbatched three-pass softmax is defined as: procedure SoftmaxUnbatched(InputInput) maxVal = -∞ denom = 0 STEP 1: find the max value in each batch for i = 0 to N do maxVal = max(maxVal, Input\[b, i]) end for STEP 2: compute the sum of exponential of each batch for i = 0 to N do Output\[b, i] = Input\[b, i] - maxVal accum += exp(Output\[b, i]) end for STEP 3: normalize each batch for i = 0 to N do Output\[b, i] -= log(accum) end for **Parameters:** * ​simd\_width (`Int`): The simd\_width to use in vectorization. * ​buffer\_size (`Dim`): The size of the input and output buffers. * ​type (`DType`): The type of the input and output buffers. * ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d. * ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda. **Args:** * ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values. `logsoftmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int)` `logsoftmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)` --- ## lookup_py_type_object `lookup_py_type_object[T: TypeIdentifiable]() -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Type")]` Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`. This function looks up the Python type object that was previously registered for the Mojo type `T` using a `PythonTypeBuilder`. The returned type object can be used to create Python objects that wrap Mojo values of type `T`. **Parameters:** * ​T (`TypeIdentifiable`): The Mojo type to look up. Must implement the `TypeIdentifiable` trait to provide a unique type identifier. **Returns:** A `TypedPythonObject["Type"]` representing the Python type object that binds the Mojo type `T` to the current CPython interpreter instance. **Raises:** If no `PythonTypeBuilder` was ever finalized for type `T`, or if no Python type object has been registered for the provided type identifier. --- ## lop `lop[lut: SIMD[int32, 1]](a: SIMD[int32, 1], b: SIMD[int32, 1], c: SIMD[int32, 1]) -> SIMD[int32, 1]` Performs an arbitrary logical operation on 3 inputs using a lookup table. Implements a 3-input lookup table (LUT) operation. The result is determined by bits in the lookup table value for each input combination. Note: * Only supported on NVIDIA GPUs. * Maps to the LOP3.B32 PTX instruction. * Lookup table value determines output for each possible input combo. **Parameters:** * ​lut (`SIMD[int32, 1]`): 32-bit lookup table value that defines the logical operation. **Args:** * ​a (`SIMD[int32, 1]`): First input value. * ​b (`SIMD[int32, 1]`): Second input value. * ​c (`SIMD[int32, 1]`): Third input value. **Returns:** Result of applying the lookup table operation to the inputs. --- ## lstat `lstat[PathLike: PathLike](path: PathLike) -> stat_result` Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks). **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns the stat\_result on the path. --- ## mac16 `mac16(gpr: Int)` SI16 matrix multiply and add. --- ## Magic changelog This is the change history for the [`magic` CLI](/magic). You can check which version you have with this command: ```sh magic --version ``` You can update to the latest version with this: ```sh magic self-update ``` ## v0.7.2 (2025-03-14) * Fixed a build regression that affected Ubuntu 22.04 compatibility. * Enhanced the `init --from` command to initialize projects from recipes. ## v0.7.1 (2025-03-11) * Added support for the `init --from` command to initialize projects from recipes. ## v0.7.0 (2025-02-19) * Many small improvements and optimizations to `magic global install` and `magic global update` which should make installing `max-pipelines` faster. * Update to [pixi 0.41.3](https://github.com/prefix-dev/pixi/releases/tag/v0.41.3) * Update to [uv v0.5.29](https://github.com/astral-sh/uv/releases/tag/0.5.29) ## v0.6.4 (2025-01-27) * Fix bug with magic init with a folder that didn't exist. ## v0.6.3 (2025-01-24) * Add max-nightly channel as a default search channel. * Update to [pixi v0.40.3](https://github.com/prefix-dev/pixi/releases/tag/v0.40.3) ## v0.6.2 (2025-01-11) * Fix for an error warning about a missing /etc/magic/config.toml file on some linux distros. ## v0.6.1 (2025-01-10) * Significant performance improvements for package solving and resolution * Fix for a bug that caused `magic` to hang with some network configurations * Update to [pixi v0.40.0](https://github.com/prefix-dev/pixi/releases/tag/v0.40.0) ## v0.5.1 (2024-12-12) * Minor bug fix release for macOS. Remove an unintended runtime dependency on a library in homebrew. ## v0.5.0 (2024-12-03) * Expose `magic auth` to allow for authentication to private channels * Expose `magic upload` to allow for uploading packages to conda channels * Update to [pixi 0.37.0](https://github.com/prefix-dev/pixi/releases/tag/v0.37.0) * Update to uv 0.4.30 to fix minor bugs with installing some pypi packages ## v0.4.0 (2024-10-24) * Updating to [pixi 0.33](https://github.com/prefix-dev/pixi/releases/tag/v0.33.0) * Fix for magic search failing outside of a project [Issue Link](https://github.com/modular/modular/issues/209) ## v0.3.1 (2024-10-03) * Fixes a certification error when fetching some packages * Fixes for telemetry data * Added controls to disable telemetry for `magic` * Print the `pixi` version with the `magic --version` command ## v0.3.0 (2024-09-20) * Updating to [pixi 0.29](https://pixi.sh/latest/CHANGELOG/#0290-2024-09-04) * Telemetry improvements ## v0.2.3 (2024-09-05) First Magic release! 🪄 Based on [Pixi 0.27.1](https://pixi.sh/latest/CHANGELOG/#0271-2024-08-09). --- ## Magic commands # Magic commands This document contains the help content for the `magic` command-line program. ## `magic` magic - A high level package management tool by Modular for developing with Mojo and MAX. To get started, run `magic init` in your project directory. To see all available commands, run `magic --help` or `magic help`. **Usage:** `magic [OPTIONS] ` ###### **Subcommands:** * `init` — Initialize a new Magic project * `add` — Adds dependencies to the project * `remove` — Removes dependencies from the project * `install` — Install all dependencies * `update` — Update dependencies as recorded in the local lock file * `upgrade` — Update the version of packages to the latest possible version, disregarding the manifest version constraints * `lock` — Solve environment and update the lock file * `run` — Runs task in project * `exec` — Run a command in a temporary environment * `shell` — Start a shell in the magic environment of the project * `shell-hook` — Print the magic environment activation script * `project` — Modify the project configuration file through the command line * `task` — Interact with tasks in the project * `list` — List project's packages * `tree` — Show a tree of project dependencies * `global` — Subcommand for global package management actions * `auth` — Login to prefix.dev or anaconda.org servers to access private channels * `config` — Configuration management * `info` — Information about the system, project and environments for the current machine * `upload` — Upload a conda package * `search` — Search a conda package * `self-update` — Update magic to the latest or a specific version * `clean` — Clean the parts of your system which are touched by magic. Defaults to cleaning the environments and task cache. Use the `cache` subcommand to clean the cache * `completion` — Generates a completion script for a shell * `telemetry` — Configure how telemetry data is emitted from magic and Modular packages * `build` — Build a project * `8ball` — Ask the 8-ball a question ###### **Options:** * `-v`, `--verbose` — Increase logging verbosity * `-q`, `--quiet` — Decrease logging verbosity * `--color ` — Whether the log needs to be colored Default value: `auto` Possible values: `always`, `never`, `auto` * `--no-progress` — Hide all progress bars Default value: `false` ## `magic init` Initialize a new Magic project **Usage:** `magic init [OPTIONS] [PATH]` ###### **Arguments:** * `` — Where to place the project (defaults to current path) Default value: `.` ###### **Options:** * `-c`, `--channel ` — Channels to use in the project * `-p`, `--platform ` — Platforms that the project supports * `-i`, `--import ` — Environment.yml file to bootstrap the project * `--format ` — The manifest format to create Possible values: `magic`, `pyproject`, `mojoproject` * `-s`, `--scm ` — Source Control Management used for this project Possible values: `github`, `gitlab`, `codeberg` * `--from ` — Initialize a project using a template/recipe. To find new recipes, see A project can be initialized a url to a zip file, a GitHub repo, or a released recipe in a GitHub repo. * Initialize with a released GitHub Recipe \* The format is for a GitHub recipe is `[//][@]` If a owner/repo is not provided explicitly, then modular/max-recipes is default. This searches for a released recipe.zip file in the GitHub repo's releases. * Initialize with a GitHub Repo \* A full GitHub repo (without history) can be used to initialize into the target folder with the format owner/repo\[@tag or branch]\[/]. A specified recipe is optional and if specified and found in the repo, only that recipe subfolder will be extracted. * Initialize with a ZIP archive \* Any https\://, http\://, and file:// URL can also be passed in explicitly. This will download and extract the full zip into the target project folder. Examples: * `magic init --from max-serve-open-webui` - Download modular/max-recipes @ HEAD and extract the max-serve-open-webui folder * `magic init --from modular/max-recipes/max-serve-open-webui` - Same but explicitly state the owner and repo. * `magic init --from modular/max-recipes/max-serve-open-webui@0.0.1` - Download a github release from modular/max-recipes looking for max-serve-open-webui @ 0.0.1 * `magic init --from https://github.com/modular/modular/archive/refs/heads/main.zip` - Download and exract the zip file at a given URL. * `magic init --from modular/max` - download the entire modular/max repo at main HEAD without git history * `magic init --from modular/max@stable` - download the entire modular/max repo at the "stable" git tag or branch ref without git history * `magic init --from modular/max/examples` - download the entire modular/max repo at main HEAD without git history and extract the example folder Note: this feature currently does not work for versions tag, branches names, tags, or recipes that contain a "/" character. WARNING: this will clobber any existing files in the target folder if it already exists. * `--run ` — Additional run command arguments to pass to the recipe after initialization. These are the same arguments that would be passed to `magic run` ## `magic add` Adds dependencies to the project The dependencies should be defined as MatchSpec for conda package, or a PyPI requirement for the `--pypi` dependencies. If no specific version is provided, the latest version compatible with your project will be chosen automatically or a \* will be used. Example usage: * `magic add python=3.10`: This will select the latest minor version that complies with 3.10.\*, i.e., python version 3.10.0, 3.10.1, 3.10.2, etc. * `magic add python`: In absence of a specified version, the latest version will be chosen. For instance, this could resolve to python version 3.11.3.\* at the time of writing. Adding multiple dependencies at once is also supported: * `magic add python pytest`: This will add both `python` and `pytest` to the project's dependencies. The `--platform` and `--build/--host` flags make the dependency target specific. * `magic add python --platform linux-64 --platform osx-arm64`: Will add the latest version of python for linux-64 and osx-arm64 platforms. * `magic add python --build`: Will add the latest version of python for as a build dependency. Mixing `--platform` and `--build`/`--host` flags is supported The `--pypi` option will add the package as a pypi dependency. This cannot be mixed with the conda dependencies * `magic add --pypi boto3` * \`magic add --pypi "boto3==version" If the project manifest is a `pyproject.toml`, adding a pypi dependency will add it to the native pyproject `project.dependencies` array or to the native `dependency-groups` table if a feature is specified: * `magic add --pypi boto3` will add `boto3` to the `project.dependencies` array * `magic add --pypi boto3 --feature aws` will add `boto3` to the `dependency-groups.aws` array Note that if `--platform` or `--editable` are specified, the pypi dependency will be added to the `tool.magic.pypi-dependencies` table instead as native arrays have no support for platform-specific or editable dependencies. These dependencies will then be read by magic as if they had been added to the magic `pypi-dependencies` tables of the default or of a named feature. The versions will be automatically added with a pinning strategy based on semver or the pinning strategy set in the config. There is a list of packages that are not following the semver versioning scheme but will use the minor version by default: Python, Rust, Julia, GCC, GXX, GFortran, NodeJS, Deno, R, R-Base, Perl **Usage:** `magic add [OPTIONS] ...` ###### **Arguments:** * `` — The dependencies as names, conda MatchSpecs or PyPi requirements ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--host` — The specified dependencies are host dependencies. Conflicts with `build` and `pypi` * `--build` — The specified dependencies are build dependencies. Conflicts with `host` and `pypi` * `--pypi` — The specified dependencies are pypi dependencies. Conflicts with `host` and `build` * `-p`, `--platform ` — The platform(s) for which the dependency should be modified * `-f`, `--feature ` — The feature for which the dependency should be modified Default value: `default` * `-g`, `--git ` — The git url to use when adding a git dependency * `--branch ` — The git branch * `--tag ` — The git tag * `--rev ` — The git revision * `-s`, `--subdir ` — The subdirectory of the git repository to use * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `--editable` — Whether the pypi requirement should be editable ## `magic remove` Removes dependencies from the project If the project manifest is a `pyproject.toml`, removing a pypi dependency with the `--pypi` flag will remove it from either - the native pyproject `project.dependencies` array or, if a feature is specified, the native `project.optional-dependencies` table - magic `pypi-dependencies` tables of the default feature or, if a feature is specified, a named feature **Usage:** `magic remove [OPTIONS] ...` ###### **Arguments:** * `` — The dependencies as names, conda MatchSpecs or PyPi requirements ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--host` — The specified dependencies are host dependencies. Conflicts with `build` and `pypi` * `--build` — The specified dependencies are build dependencies. Conflicts with `host` and `pypi` * `--pypi` — The specified dependencies are pypi dependencies. Conflicts with `host` and `build` * `-p`, `--platform ` — The platform(s) for which the dependency should be modified * `-f`, `--feature ` — The feature for which the dependency should be modified Default value: `default` * `-g`, `--git ` — The git url to use when adding a git dependency * `--branch ` — The git branch * `--tag ` — The git tag * `--rev ` — The git revision * `-s`, `--subdir ` — The subdirectory of the git repository to use * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment ## `magic install` Install all dependencies **Usage:** `magic install [OPTIONS]` ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `-e`, `--environment ` — The environment to install * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `-a`, `--all` ## `magic update` Update dependencies as recorded in the local lock file **Usage:** `magic update [OPTIONS] [PACKAGES]...` ###### **Arguments:** * `` — The packages to update ###### **Options:** * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--no-install` — Don't install the (solve) environments needed for pypi-dependencies solving * `-n`, `--dry-run` — Don't actually write the lockfile or update any environment * `-e`, `--environment ` — The environments to update. If none is specified, all environments are updated * `-p`, `--platform ` — The platforms to update. If none is specified, all platforms are updated * `--json` — Output the changes in JSON format ## `magic upgrade` Update the version of packages to the latest possible version, disregarding the manifest version constraints **Usage:** `magic upgrade [OPTIONS] [PACKAGES]...` ###### **Arguments:** * `` — The packages to upgrade ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `-f`, `--feature ` — The feature to update Default value: `default` * `--exclude ` — The packages which should be excluded * `--json` — Output the changes in JSON format * `-n`, `--dry-run` — Only show the changes that would be made, without actually updating the manifest, lock file, or environment ## `magic lock` Solve environment and update the lock file **Usage:** `magic lock [OPTIONS]` ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--json` — Output the changes in JSON format ## `magic run` Runs task in project **Usage:** `magic run [OPTIONS] [TASK]...` ###### **Arguments:** * `` — The magic task or a task shell command you want to run in the project's environment, which can be an executable in the environment's PATH ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `--force-activate` — Do not use the environment activation cache. (default: true except in experimental mode) * `-e`, `--environment ` — The environment to run the task in * `--clean-env` — Use a clean environment to run the task Using this flag will ignore your current shell environment and use bare minimum environment to activate the magic environment in. * `--skip-deps` — Don't run the dependencies of the task ('depends-on' field in the task definition) * `-n`, `--dry-run` — Run the task in dry-run mode (only print the command that would run) * `--help` Possible values: `true`, `false` * `-h` Possible values: `true`, `false` ## `magic exec` Run a command in a temporary environment **Usage:** `magic exec [OPTIONS] [COMMAND]...` ###### **Arguments:** * `` — The executable to run ###### **Options:** * `-s`, `--spec ` — Matchspecs of packages to install. If this is not provided, the package is guessed from the command * `-c`, `--channel ` — The channels to consider as a name or a url. Multiple channels can be specified by using this field multiple times. When specifying a channel, it is common that the selected channel also depends on the `conda-forge` channel. By default, if no channel is provided, `conda-forge` is used. * `-p`, `--platform ` — The platform to create the environment for Default value: `osx-arm64` * `--force-reinstall` — If specified a new environment is always created even if one already exists * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic shell` Start a shell in the magic environment of the project **Usage:** `magic shell [OPTIONS]` ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `-e`, `--environment ` — The environment to activate in the shell * `--change-ps1 ` — Do not change the PS1 variable when starting a prompt Possible values: `true`, `false` * `--force-activate` — Do not use the environment activation cache. (default: true except in experimental mode) ## `magic shell-hook` Print the magic environment activation script. You can source the script to activate the environment without needing magic itself. **Usage:** `magic shell-hook [OPTIONS]` ###### **Options:** * `-s`, `--shell ` — Sets the shell, options: [`bash`, `zsh`, `xonsh`, `cmd`, `powershell`, `fish`, `nushell`] * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `--force-activate` — Do not use the environment activation cache. (default: true except in experimental mode) * `-e`, `--environment ` — The environment to activate in the script * `--json` — Emit the environment variables set by running the activation as JSON Default value: `false` * `--change-ps1 ` — Do not change the PS1 variable when starting a prompt Possible values: `true`, `false` ## `magic project` Modify the project configuration file through the command line **Usage:** `magic project [OPTIONS] ` ###### **Subcommands:** * `channel` — Commands to manage project channels * `description` — Commands to manage project description * `platform` — Commands to manage project platforms * `version` — Commands to manage project version * `environment` — Commands to manage project environments * `export` — Commands to export projects to other formats * `name` — Commands to manage project name * `system-requirements` — Commands to manage project environments ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project channel` Commands to manage project channels **Usage:** `magic project channel ` ###### **Subcommands:** * `add` — Adds a channel to the project file and updates the lockfile * `list` — List the channels in the project file * `remove` — Remove channel(s) from the project file and updates the lockfile ## `magic project channel add` Adds a channel to the project file and updates the lockfile **Usage:** `magic project channel add [OPTIONS] ...` ###### **Arguments:** * `` — The channel name or URL ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--priority ` — Specify the channel priority * `--prepend` — Add the channel(s) to the beginning of the channels list, making them the highest priority * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `-f`, `--feature ` — The name of the feature to modify ## `magic project channel list` List the channels in the project file **Usage:** `magic project channel list [OPTIONS]` ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--urls` — Whether to display the channel's names or urls ## `magic project channel remove` Remove channel(s) from the project file and updates the lockfile **Usage:** `magic project channel remove [OPTIONS] ...` ###### **Arguments:** * `` — The channel name or URL ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--priority ` — Specify the channel priority * `--prepend` — Add the channel(s) to the beginning of the channels list, making them the highest priority * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `-f`, `--feature ` — The name of the feature to modify ## `magic project description` Commands to manage project description **Usage:** `magic project description [OPTIONS] ` ###### **Subcommands:** * `get` — Get the project description * `set` — Set the project description ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project description get` Get the project description **Usage:** `magic project description get` ## `magic project description set` Set the project description **Usage:** `magic project description set ` ###### **Arguments:** * `` — The project description ## `magic project platform` Commands to manage project platforms **Usage:** `magic project platform [OPTIONS] ` ###### **Subcommands:** * `add` — Adds a platform(s) to the project file and updates the lockfile * `list` — List the platforms in the project file * `remove` — Remove platform(s) from the project file and updates the lockfile ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project platform add` Adds a platform(s) to the project file and updates the lockfile **Usage:** `magic project platform add [OPTIONS] ...` ###### **Arguments:** * `` — The platform name(s) to add ###### **Options:** * `--no-install` — Don't update the environment, only add changed packages to the lock-file * `-f`, `--feature ` — The name of the feature to add the platform to ## `magic project platform list` List the platforms in the project file **Usage:** `magic project platform list` ## `magic project platform remove` Remove platform(s) from the project file and updates the lockfile **Usage:** `magic project platform remove [OPTIONS] ...` ###### **Arguments:** * `` — The platform name(s) to remove ###### **Options:** * `--no-install` — Don't update the environment, only remove the platform(s) from the lock-file * `-f`, `--feature ` — The name of the feature to remove the platform from ## `magic project version` Commands to manage project version **Usage:** `magic project version [OPTIONS] ` ###### **Subcommands:** * `get` — Get the workspace version * `set` — Set the workspace version * `major` — Bump the workspace version to MAJOR * `minor` — Bump the workspace version to MINOR * `patch` — Bump the workspace version to PATCH ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project version get` Get the workspace version **Usage:** `magic project version get` ## `magic project version set` Set the workspace version **Usage:** `magic project version set ` ###### **Arguments:** * `` — The new project version ## `magic project version major` Bump the workspace version to MAJOR **Usage:** `magic project version major` ## `magic project version minor` Bump the workspace version to MINOR **Usage:** `magic project version minor` ## `magic project version patch` Bump the workspace version to PATCH **Usage:** `magic project version patch` ## `magic project environment` Commands to manage project environments **Usage:** `magic project environment [OPTIONS] ` ###### **Subcommands:** * `add` — Adds an environment to the manifest file * `list` — List the environments in the manifest file * `remove` — Remove an environment from the manifest file ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project environment add` Adds an environment to the manifest file **Usage:** `magic project environment add [OPTIONS] ` ###### **Arguments:** * `` — The name of the environment to add ###### **Options:** * `-f`, `--feature ` — Features to add to the environment * `--solve-group ` — The solve-group to add the environment to * `--no-default-feature` — Don't include the default feature in the environment Default value: `false` * `--force` — Update the manifest even if the environment already exists Default value: `false` ## `magic project environment list` List the environments in the manifest file **Usage:** `magic project environment list` ## `magic project environment remove` Remove an environment from the manifest file **Usage:** `magic project environment remove ` ###### **Arguments:** * `` — The name of the environment to remove ## `magic project export` Commands to export projects to other formats **Usage:** `magic project export ` ###### **Subcommands:** * `conda-explicit-spec` — Export project environment to a conda explicit specification file * `conda-environment` — Export project environment to a conda environment.yaml file ## `magic project export conda-explicit-spec` Export project environment to a conda explicit specification file **Usage:** `magic project export conda-explicit-spec [OPTIONS] ` ###### **Arguments:** * `` — Output directory for rendered explicit environment spec files ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-e`, `--environment ` * `-p`, `--platform ` — The platform to render. Can be repeated for multiple platforms. Defaults to all platforms available for selected environments * `--ignore-pypi-errors` — PyPI dependencies are not supported in the conda explicit spec file Default value: `false` * `--ignore-source-errors` — Source dependencies are not supported in the conda explicit spec file Default value: `false` * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment ## `magic project export conda-environment` Export project environment to a conda environment.yaml file **Usage:** `magic project export conda-environment [OPTIONS] [OUTPUT_PATH]` ###### **Arguments:** * `` — Explicit path to export the environment to ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-p`, `--platform ` — The platform to render the environment file for. Defaults to the current platform * `-e`, `--environment ` — The environment to render the environment file for. Defaults to the default environment ## `magic project name` Commands to manage project name **Usage:** `magic project name [OPTIONS] ` ###### **Subcommands:** * `get` — Get the project name * `set` — Set the project name ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project name get` Get the project name **Usage:** `magic project name get` ## `magic project name set` Set the project name **Usage:** `magic project name set ` ###### **Arguments:** * `` — The project name ## `magic project system-requirements` Commands to manage project environments **Usage:** `magic project system-requirements [OPTIONS] ` ###### **Subcommands:** * `add` — Adds an environment to the manifest file * `list` — List the environments in the manifest file ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic project system-requirements add` Adds an environment to the manifest file **Usage:** `magic project system-requirements add [OPTIONS] ` ###### **Arguments:** * `` — The name of the system requirement to add Possible values: * `linux`: The version of the linux kernel (Find with `uname -r`) * `cuda`: The version of the CUDA driver (Find with `nvidia-smi`) * `macos`: The version of MacOS (Find with `sw_vers`) * `glibc`: The version of the glibc library (Find with `ldd --version`) * `other-libc`: Non Glibc libc family and version (Find with `ldd --version`) * `` — The version of the requirement ###### **Options:** * `--family ` — The Libc family, this can only be specified for requirement `other-libc` * `-f`, `--feature ` — The name of the feature to modify ## `magic project system-requirements list` List the environments in the manifest file **Usage:** `magic project system-requirements list [OPTIONS]` ###### **Options:** * `--json` * `-e`, `--environment ` ## `magic task` Interact with tasks in the project **Usage:** `magic task [OPTIONS] ` ###### **Subcommands:** * `add` — Add a command to the project * `remove` — Remove a command from the project * `alias` — Alias another specific command * `list` — List all tasks in the project ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic task add` Add a command to the project **Usage:** `magic task add [OPTIONS] ...` ###### **Arguments:** * `` — Task name * `` — One or more commands to actually execute ###### **Options:** * `--depends-on ` — Depends on these other commands * `-p`, `--platform ` — The platform for which the task should be added * `-f`, `--feature ` — The feature for which the task should be added * `--cwd ` — The working directory relative to the root of the project * `--env ` — The environment variable to set, use --env key=value multiple times for more than one variable * `--description ` — A description of the task to be added * `--clean-env` — Isolate the task from the shell environment, and only use the magic environment to run the task ## `magic task remove` Remove a command from the project **Usage:** `magic task remove [OPTIONS] [NAMES]...` ###### **Arguments:** * `` — Task names to remove ###### **Options:** * `-p`, `--platform ` — The platform for which the task should be removed * `-f`, `--feature ` — The feature for which the task should be removed ## `magic task alias` Alias another specific command **Usage:** `magic task alias [OPTIONS] ...` ###### **Arguments:** * `` — Alias name * `` — Depends on these tasks to execute ###### **Options:** * `-p`, `--platform ` — The platform for which the alias should be added * `--description ` — The description of the alias task ## `magic task list` List all tasks in the project **Usage:** `magic task list [OPTIONS]` ###### **Options:** * `-s`, `--summary` — Tasks available for this machine per environment * `-e`, `--environment ` — The environment the list should be generated for. If not specified, the default environment is used * `--json` — List as json instead of a tree If not specified, the default environment is used ## `magic list` List project's packages. Highlighted packages are explicit dependencies. **Usage:** `magic list [OPTIONS] [REGEX]` ###### **Arguments:** * `` — List only packages matching a regular expression ###### **Options:** * `--platform ` — The platform to list packages for. Defaults to the current platform * `--json` — Whether to output in json format * `--json-pretty` — Whether to output in pretty json format * `--sort-by ` — Sorting strategy Default value: `name` Possible values: `size`, `name`, `kind` * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-e`, `--environment ` — The environment to list packages for. Defaults to the default environment * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `-x`, `--explicit` — Only list packages that are explicitly defined in the project ## `magic tree` Show a tree of project dependencies Dependency names highlighted in green are directly specified in the manifest. Yellow version numbers are conda packages, PyPI version numbers are blue. **Usage:** `magic tree [OPTIONS] [REGEX]` ###### **Arguments:** * `` — List only packages matching a regular expression ###### **Options:** * `-p`, `--platform ` — The platform to list packages for. Defaults to the current platform * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-e`, `--environment ` — The environment to list packages for. Defaults to the default environment * `--no-lockfile-update` — Don't update lockfile, implies the no-install as well * `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file * `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file * `--no-install` — Don't modify the environment, only modify the lock-file * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `--revalidate` — Run the complete environment validation. This will reinstall a broken environment * `-i`, `--invert` — Invert tree and show what depends on given package in the regex argument ## `magic global` Subcommand for global package management actions Install packages on the user level. Example: magic global install my\_package magic global remove my\_package **Usage:** `magic global ` ###### **Subcommands:** * `add` — Adds dependencies to an environment * `edit` — Edit the global manifest file * `install` — Installs the defined packages in a globally accessible location and exposes their command line applications. * `uninstall` — Uninstalls environments from the global environment. * `remove` — Removes dependencies from an environment * `list` — Lists all packages previously installed into a globally accessible location via `magic global install`. * `sync` — Sync global manifest with installed environments * `expose` — Interact with the exposure of binaries in the global environment * `update` — Updates environments in the global environment ## `magic global add` Adds dependencies to an environment Example: * magic global add --environment python numpy * magic global add --environment my\_env pytest pytest-cov --expose pytest=pytest **Usage:** `magic global add [OPTIONS] --environment ...` ###### **Arguments:** * `` — Specifies the packages that are to be added to the environment ###### **Options:** * `-e`, `--environment ` — Specifies the environment that the dependencies need to be added to * `--expose ` — Add one or more mapping which describe which executables are exposed. The syntax is `exposed_name=executable_name`, so for example `python3.10=python`. Alternatively, you can input only an executable\_name and `executable_name=executable_name` is assumed * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic global edit` Edit the global manifest file Opens your editor to edit the global manifest file. **Usage:** `magic global edit [EDITOR]` ###### **Arguments:** * `` — The editor to use, defaults to `EDITOR` environment variable or `nano` on Unix and `notepad` on Windows ## `magic global install` Installs the defined packages in a globally accessible location and exposes their command line applications. Example: * magic global install starship nushell ripgrep bat * magic global install jupyter --with polars * magic global install --expose python3.8=python python=3.8 * magic global install --environment science --expose jupyter --expose ipython jupyter ipython polars **Usage:** `magic global install [OPTIONS] ...` ###### **Arguments:** * `` — Specifies the packages that are to be installed ###### **Options:** * `-c`, `--channel ` — The channels to consider as a name or a url. Multiple channels can be specified by using this field multiple times. When specifying a channel, it is common that the selected channel also depends on the `conda-forge` channel. By default, if no channel is provided, `conda-forge` is used. * `-p`, `--platform ` * `-e`, `--environment ` — Ensures that all packages will be installed in the same environment * `--expose ` — Add one or more mapping which describe which executables are exposed. The syntax is `exposed_name=executable_name`, so for example `python3.10=python`. Alternatively, you can input only an executable\_name and `executable_name=executable_name` is assumed * `--with ` — Add additional dependencies to the environment. Their executables will not be exposed * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `-u`, `--force-reinstall` — Specifies that the packages should be reinstalled even if they are already installed ## `magic global uninstall` Uninstalls environments from the global environment. Example: magic global uninstall magic-pack rattler-build **Usage:** `magic global uninstall [OPTIONS] ...` ###### **Arguments:** * `` — Specifies the environments that are to be removed ###### **Options:** * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic global remove` Removes dependencies from an environment Use `magic global uninstall` to remove the whole environment Example: * magic global remove --environment python numpy **Usage:** `magic global remove [OPTIONS] ...` ###### **Arguments:** * `` — Specifies the packages that are to be removed ###### **Options:** * `-e`, `--environment ` — Specifies the environment that the dependencies need to be removed from * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic global list` Lists all packages previously installed into a globally accessible location via `magic global install`. All environments: * Yellow: the binaries that are exposed. * Green: the packages that are explicit dependencies of the environment. * Blue: the version of the installed package. * Cyan: the name of the environment. Per environment: * Green: packages that are explicitly installed. **Usage:** `magic global list [OPTIONS] [REGEX]` ###### **Arguments:** * `` — List only packages matching a regular expression. Without regex syntax it acts like a `contains` filter ###### **Options:** * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `-e`, `--environment ` — The name of the environment to list * `--sort-by ` — Sorting strategy for the package table of an environment Default value: `name` Possible values: `size`, `name` ## `magic global sync` Sync global manifest with installed environments **Usage:** `magic global sync [OPTIONS]` ###### **Options:** * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic global expose` Interact with the exposure of binaries in the global environment `magic global expose add python310=python3.10 --environment myenv` will expose the `python3.10` executable as `python310` from the environment `myenv` `magic global expose remove python310 --environment myenv` will remove the exposed name `python310` from the environment `myenv` **Usage:** `magic global expose ` ###### **Subcommands:** * `add` — Add exposed binaries from an environment to your global environment * `remove` — Remove exposed binaries from the global environment ## `magic global expose add` Add exposed binaries from an environment to your global environment Example: * magic global expose add python310=python3.10 python3=python3 --environment myenv * magic global add --environment my\_env pytest pytest-cov --expose pytest=pytest **Usage:** `magic global expose add [OPTIONS] --environment [MAPPINGS]...` ###### **Arguments:** * `` — Add one or more mapping which describe which executables are exposed. The syntax is `exposed_name=executable_name`, so for example `python3.10=python`. Alternatively, you can input only an executable\_name and `executable_name=executable_name` is assumed ###### **Options:** * `-e`, `--environment ` — The environment to which the binaries should be exposed * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic global expose remove` Remove exposed binaries from the global environment `magic global expose remove python310 python3 --environment myenv` will remove the exposed names `python310` and `python3` from the environment `myenv` **Usage:** `magic global expose remove [OPTIONS] [EXPOSED_NAMES]...` ###### **Arguments:** * `` — The exposed names that should be removed ###### **Options:** * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic global update` Updates environments in the global environment **Usage:** `magic global update [OPTIONS] [ENVIRONMENTS]...` ###### **Arguments:** * `` — Specifies the environments that are to be updated ###### **Options:** * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 ## `magic auth` Login to prefix.dev or anaconda.org servers to access private channels **Usage:** `magic auth ` ###### **Subcommands:** * `login` — Store authentication information for a given host * `logout` — Remove authentication information for a given host ## `magic auth login` Store authentication information for a given host **Usage:** `magic auth login [OPTIONS] ` ###### **Arguments:** * `` — The host to authenticate with (e.g. repo.prefix.dev) ###### **Options:** * `--token ` — The token to use (for authentication with prefix.dev) * `--username ` — The username to use (for basic HTTP authentication) * `--password ` — The password to use (for basic HTTP authentication) * `--conda-token ` — The token to use on anaconda.org / quetz authentication * `--s3-access-key-id ` — The S3 access key ID * `--s3-secret-access-key ` — The S3 secret access key * `--s3-session-token ` — The S3 session token ## `magic auth logout` Remove authentication information for a given host **Usage:** `magic auth logout ` ###### **Arguments:** * `` — The host to remove authentication for ## `magic config` Configuration management **Usage:** `magic config ` ###### **Subcommands:** * `edit` — Edit the configuration file * `list` — List configuration values * `prepend` — Prepend a value to a list configuration key * `append` — Append a value to a list configuration key * `set` — Set a configuration value * `unset` — Unset a configuration value ## `magic config edit` Edit the configuration file **Usage:** `magic config edit [OPTIONS] [EDITOR]` ###### **Arguments:** * `` — The editor to use, defaults to `EDITOR` environment variable or `nano` on Unix and `notepad` on Windows ###### **Options:** * `-l`, `--local` — Operation on project-local configuration * `-g`, `--global` — Operation on global configuration * `-s`, `--system` — Operation on system configuration * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic config list` List configuration values Example: magic config list default-channels **Usage:** `magic config list [OPTIONS] [KEY]` ###### **Arguments:** * `` — Configuration key to show (all if not provided) ###### **Options:** * `--json` — Output in JSON format * `-l`, `--local` — Operation on project-local configuration * `-g`, `--global` — Operation on global configuration * `-s`, `--system` — Operation on system configuration * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic config prepend` Prepend a value to a list configuration key Example: magic config prepend default-channels bioconda **Usage:** `magic config prepend [OPTIONS] ` ###### **Arguments:** * `` — Configuration key to set * `` — Configuration value to (pre|ap)pend ###### **Options:** * `-l`, `--local` — Operation on project-local configuration * `-g`, `--global` — Operation on global configuration * `-s`, `--system` — Operation on system configuration * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic config append` Append a value to a list configuration key Example: magic config append default-channels bioconda **Usage:** `magic config append [OPTIONS] ` ###### **Arguments:** * `` — Configuration key to set * `` — Configuration value to (pre|ap)pend ###### **Options:** * `-l`, `--local` — Operation on project-local configuration * `-g`, `--global` — Operation on global configuration * `-s`, `--system` — Operation on system configuration * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic config set` Set a configuration value Example: magic config set default-channels '\["conda-forge", "bioconda"]' **Usage:** `magic config set [OPTIONS] [VALUE]` ###### **Arguments:** * `` — Configuration key to set * `` — Configuration value to set (key will be unset if value not provided) ###### **Options:** * `-l`, `--local` — Operation on project-local configuration * `-g`, `--global` — Operation on global configuration * `-s`, `--system` — Operation on system configuration * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic config unset` Unset a configuration value Example: magic config unset default-channels **Usage:** `magic config unset [OPTIONS] ` ###### **Arguments:** * `` — Configuration key to unset ###### **Options:** * `-l`, `--local` — Operation on project-local configuration * `-g`, `--global` — Operation on global configuration * `-s`, `--system` — Operation on system configuration * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic info` Information about the system, project and environments for the current machine **Usage:** `magic info [OPTIONS]` ###### **Options:** * `--extended` — Show cache and environment size * `--json` — Whether to show the output as JSON or not * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory ## `magic upload` Upload a conda package With this command, you can upload a conda package to a channel. Example: magic upload my\_package.conda Use `magic auth login` to authenticate with the server. **Usage:** `magic upload ` ###### **Arguments:** * `` — The host + channel to upload to * `` — The file to upload ## `magic search` Search a conda package Its output will list the latest version of package. **Usage:** `magic search [OPTIONS] ` ###### **Arguments:** * `` — Name of package to search ###### **Options:** * `-c`, `--channel ` — The channels to consider as a name or a url. Multiple channels can be specified by using this field multiple times. When specifying a channel, it is common that the selected channel also depends on the `conda-forge` channel. By default, if no channel is provided, `conda-forge` is used. * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-p`, `--platform ` — The platform to search for, defaults to current platform Default value: `osx-arm64` * `-l`, `--limit ` — Limit the number of search results ## `magic self-update` Update magic to the latest or a specific version. Note: If the magic binary is not found in the default location (e.g. `~/.modular/bin/magic`), magic won't update to prevent breaking the current installation. **Usage:** `magic self-update [OPTIONS]` ###### **Options:** * `--version ` — The version to downgrade or upgrade to. The latest version is used if not specified * `--force` — Force the update even if the magic binary is not found in the default location ## `magic clean` Clean the parts of your system which are touched by magic. Defaults to cleaning the environments and task cache. Use the `cache` subcommand to clean the cache **Usage:** `magic clean [OPTIONS] [COMMAND]` ###### **Subcommands:** * `cache` — Clean the cache of your system which are touched by magic ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-e`, `--environment ` — The environment directory to remove * `--activation-cache` — Only remove the activation cache ## `magic clean cache` Clean the cache of your system which are touched by magic **Usage:** `magic clean cache [OPTIONS]` ###### **Options:** * `--pypi` — Clean only the pypi related cache * `--conda` — Clean only the conda related cache * `--mapping` — Clean only the mapping cache * `--exec` — Clean only `exec` cache * `--repodata` — Clean only the repodata cache * `--tool` — Clean only the build backend tools cache * `-y`, `--yes` — Answer yes to all questions ## `magic completion` Generates a completion script for a shell **Usage:** `magic completion --shell ` ###### **Options:** * `-s`, `--shell ` — The shell to generate a completion script for Possible values: * `bash`: Bourne Again SHell (bash) * `elvish`: Elvish shell * `fish`: Friendly Interactive SHell (fish) * `nushell`: Nushell * `powershell`: PowerShell * `zsh`: Z SHell (zsh) ## `magic telemetry` Configure how telemetry data is emitted from magic and Modular packages **Usage:** `magic telemetry [OPTIONS]` ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `-e`, `--environment ` — The environment to control telemetry for Modular packages (default environment, if unspecified) * `--enable` — Enable telemetry * `--disable` — Disable telemetry ## `magic build` Build a project **Usage:** `magic build [OPTIONS]` ###### **Options:** * `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory * `--tls-no-verify` — Do not verify the TLS certificate of the server * `--auth-file ` — Path to the file containing the authentication token * `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider Possible values: `disabled`, `subprocess` * `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs * `--concurrent-downloads ` — Max concurrent network requests, default is 50 * `-t`, `--target-platform ` — The target platform to build for (defaults to the current platform) Default value: `osx-arm64` * `-o`, `--output-dir ` — The output directory to place the build artifacts Default value: `.` ## `magic 8ball` Ask the 8-ball a question **Usage:** `magic 8ball [OPTIONS] ` ###### **Arguments:** * `` — The question to ask the 8-ball ###### **Options:** * `-d`, `--debug` — Enable debug verbose output * `-g`, `--generate` — Use the 'generate' command instead of 'serve' * `-m`, `--model ` — Model to use Default value: `modularai/Llama-3.1-8B-Instruct-GGUF` * `--start-server` — Start the server instead of using a pre-existing one (default) * `-n`, `--no-start-server` — Do not start the server --- ## Magic FAQ ## Why did you create Magic? We created Magic to simplify your developer experience with Mojo. When you're developing with MAX and Mojo, your code doesn’t exist in a vacuum. Your project has dependencies and runtime requirements, such as specific Python versions, Python packages, and potentially other Mojo code. Previously, you might have required separate tools for managing Python toolchains, handling Python dependencies, and managing MAX/Mojo toolchains and dependencies. That could be four or more systems to create a consistent, reproducible build environment. This is where Magic comes in. Magic is a package manager and virtual environment system that unifies all these dependency management tasks with one tool. It also works seamlessly with popular Python package repositories and tools, while still allowing us to customize the virtual environment, packaging, and build tools for the growing MAX/Mojo platform. Magic ensures that your builds are consistent, reproducible, and ready for production, no matter where they’re deployed. And, because we built Magic upon the already amazing [pixi](https://github.com/prefix-dev/pixi) tool, it provides a smooth experience that feels like magic. 🪄 [Install Magic now](/magic/#install-magic). ## Why not just use conda? We love conda and all the tools in the conda ecosystem, but conda alone doesn't do all the things that we want to do for MAX and Mojo projects. So we were thrilled when we saw that [prefix.dev](https://prefix.dev) already built a tool that improves upon conda in all the ways that we wanted to. Because Magic is just a small extension to their pixi tool, we suggest you read their explanations for why they built pixi: * [Let's stop dependency hell](https://prefix.dev/blog/launching_pixi) * [7 reasons to Switch from Conda to Pixi](https://prefix.dev/blog/pixi_a_fast_conda_alternative) * [Pixi FAQ](https://pixi.sh/latest/FAQ/) That said, the `max` package is a conda package, and you can also install it using other conda tools. For details, see how to [add MAX/Mojo in a conda project](/magic/conda). ## Why not just use pixi? We have every intention of contributing changes to the [pixi project](https://github.com/prefix-dev/pixi). However, Mojo is still a very young language, MAX requires some unique environment settings, and we're still building and planning a lot of features for our build and packaging system. So it's simply too soon to contribute some of our changes to a project like pixi, and we currently can't make some of their features work with MAX/Mojo (`magic` is missing some commands available in `pixi`). The pixi team has a much larger developer community that they must prioritize and support. Meanwhile, we're building a new language and new developer tools from the ground up and we need to iterate fast. Quite simply, our projects have different priorities right now. That said, we have a very good relationship with the pixi team. They've been nothing but supportive and helpful in our endeavour, and we look forward to collaborating with them. ## Do I have to use Magic? No. You can also install MAX and Mojo (the `max` package) [using other conda tools](/magic/conda) or—as of version 25.3—[with pip](/magic/pip). Although we now support installing with pip, we still recommend using `magic` or `conda` for Mojo development, because the Python wheel installed with `pip` currently doesn't include the Mojo LSP or debugger. So you'll have a better IDE experience with Mojo through `magic`. ## Is the Magic tool open sourced? Not today. We worked really hard to get Magic released as quickly as possible, and properly open sourcing any software is also a lot of work. The `magic` code currently has nothing proprietary in it, so it should just be a matter of time. ## What's the alpha-numeric string in the Magic install URL? The string appended to the `https://magic.modular.com` URL is a universally unique identifier (UUID) that helps us improve our tools and user experience. Any data associated with this ID is anonymized and not linked to any personally identifiable information. --- ## make_buffer_resource `make_buffer_resource[type: DType](gds_ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], num_records: Int = __init__[::Intable](SIMD(max_or_inf[::DType]()))) -> SIMD[uint32, 4]` Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations. This function constructs a 128-bit buffer resource descriptor used by AMD GPUs for buffer load/store operations. The descriptor contains information about the memory location, size, and access properties needed by the hardware to perform memory operations. Notes: * Only supported on AMD GPUs. * The descriptor follows AMD's hardware-specific format: * Bits 0-63: Base address * Bits 64-95: Number of records (size) * Bits 96-127: Flags controlling access properties * Used with buffer\_load and buffer\_store operations. * Performance-critical for optimized memory access patterns on AMD GPUs. Example: ```mojo from gpu.intrinsics import make_buffer_resource var ptr = UnsafePointer[Scalar[DType.float32]].alloc(1024) var resource = make_buffer_resource[DType.float32](ptr, 1024) # Use resource with buffer_load/buffer_store operations ``` . **Parameters:** * ​type (`DType`): The data type of elements in the buffer. **Args:** * ​gds\_ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Global memory base address pointer to the start of the buffer. * ​num\_records (`Int`): Maximum number of records that can be accessed through this resource descriptor. Reads with offsets beyond this value return 0. Defaults to UInt32.MAX for maximum possible range. **Returns:** A 128-bit buffer resource descriptor as a SIMD\[DType.uint32, 4]. --- ## make_layout `make_layout(*layouts: Layout) -> Layout` Creates a composite layout by concatenating multiple layouts. This function combines multiple layouts into a single layout by concatenating their shapes and strides. The resulting layout represents a hierarchical structure where each input layout becomes a component of the output layout. Example: ```mojo from layout import Layout, IntTuple from layout.layout import make_layout var layout1 = Layout(IntTuple(2, 3), IntTuple(3, 1)) var layout2 = Layout(IntTuple(4, 5), IntTuple(5, 1)) var combined = make_layout(layout1, layout2) # Result: Layout with shape ((2, 3), (4, 5)) and stride ((3, 1), (5, 1)) ``` . **Args:** * ​\*layouts (`Layout`): Variable number of `Layout` objects to combine. **Returns:** A new Layout with concatenated shapes and strides from the input layouts. `make_layout(layout_a: Layout, layout_b: Layout) -> Layout` Creates a composite layout from two layouts. This is a specialized version of make\_layout that takes exactly two layouts and combines them into a single layout. This function exists as a workaround for compiler limitations. **Args:** * ​layout\_a (`Layout`): The first layout to include in the composite. * ​layout\_b (`Layout`): The second layout to include in the composite. **Returns:** A new `Layout` with concatenated shapes and strides from the input layouts. --- ## make_layout `make_layout[l1: Layout, l2: Layout, /, *, linear_idx_type: DType = uint64](a: RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type], b: RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[make_layout(l1, l2), element_type=element_type, linear_idx_type=linear_idx_type]` Combine two runtime layouts into a single composite layout. This creates a new layout by concatenating the dimensions and strides of the input layouts. **Parameters:** * ​l1 (`Layout`): The static layout type of `a`. * ​l2 (`Layout`): The static layout type of `b`. * ​linear\_idx\_type (`DType`): The integer type of the all index calculated by the returned runtime layout. **Args:** * ​a (`RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type]`): The first `RuntimeLayout` to combine. * ​b (`RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]`): The second `RuntimeLayout` to combine. **Returns:** A new `RuntimeLayout` with dimensions from both input layouts. --- ## make_ldmatrix_swizzle `make_ldmatrix_swizzle[type: DType, row_size: Int, log2_vector_width: Int = 0]() -> Swizzle` Make swizzle to avoid bank conflict for ldmatrix ops. Creates a swizzle pattern optimized for `ldmatrix` operations. Minimizes bank conflicts in shared memory for these operations. Calculates swizzle parameters based on data type and row size. **Parameters:** * ​type (`DType`): The data type of the elements. * ​row\_size (`Int`): Size of each row in elements. * ​log2\_vector\_width (`Int`): Log2 of the vector width (default: 0). **Returns:** A `Swizzle` object configured for `ldmatrix`. --- ## make_ordered_layout `make_ordered_layout(shape: IntTuple[origin], order: IntTuple[origin]) -> Layout` Creates a layout with strides ordered according to a specified traversal order. This function generates a compact (bijective) layout where the stride values follow the traversal order specified by the order parameter. This allows creating layouts with custom memory traversal patterns while maintaining a compact memory representation. Example: ```mojo from layout import IntTuple, Layout from layout.layout import make_ordered_layout # Create a layout with shape (2,3,4,5) where dimensions are traversed # in the order: dim0, dim3, dim2, dim1 var layout = make_ordered_layout( IntTuple(2, 3, 4, 5), IntTuple(1, 4, 3, 2) ) # Result: Layout with shape (2,3,4,5) and stride (1,24,6,2) ``` . **Args:** * ​shape (`IntTuple[origin]`): The shape of the layout. * ​order (`IntTuple[origin]`): The traversal order priority (lower values indicate higher priority). **Returns:** A `Layout` with the specified shape and strides ordered according to the traversal order. --- ## make_swizzle `make_swizzle[num_rows: Int, row_size: Int, access_size: Int]() -> Swizzle` Create a 2D swizzle to avoid bank conflicts. Generates a swizzle pattern for 2D memory layout to minimize bank conflicts in shared memory access. **Parameters:** * ​num\_rows (`Int`): Number of rows in the minimum access pattern. * ​row\_size (`Int`): Size of each row in elements. * ​access\_size (`Int`): Number of elements accessed at once. **Returns:** A `Swizzle` object for 2D memory access. `make_swizzle[type: DType, mode: TensorMapSwizzle]() -> Swizzle` Create swizzle based on predefined swizzle modes. Returns a swizzle pattern based on standard modes (32B, 64B, 128B, none), adjusted for data type. **Parameters:** * ​type (`DType`): The data type of the elements. * ​mode (`TensorMapSwizzle`): The swizzle mode to use (TensorMapSwizzle enum). **Returns:** A `Swizzle` object configured by the specified mode. --- ## makedirs `makedirs[PathLike: PathLike](path: PathLike, mode: Int = 511, exist_ok: Bool = False)` Creates a specified leaf directory along with any necessary intermediate directories that don't already exist. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. * ​mode (`Int`): The mode to create the directory with. * ​exist\_ok (`Bool`): Ignore error if `True` and path exists (default `False`). --- ## MakeLayoutList `MakeLayoutList(v0: Layout, v1: Layout) -> List[Layout]` Creates a list containing two layouts. This is a convenience function for creating a LayoutList with two elements. **Args:** * ​v0 (`Layout`): The first layout to include in the list. * ​v1 (`Layout`): The second layout to include in the list. **Returns:** A LayoutList containing the two provided layouts. --- ## MakeTileLayoutList `MakeTileLayoutList[*tile_sizes: Int]() -> List[Layout]` Creates a list of layouts for tiling operations. This function creates a list of simple layouts, each with a shape from the provided tile\_sizes and a stride of 1. These layouts can be used for tiling operations. **Parameters:** * ​\*tile\_sizes (`Int`): Variable number of integer tile dimensions. **Returns:** A LayoutList containing layouts for each tile size. --- ## Mammoth import ContactSection from '@site/src/components/ContactSection'; Mammoth (formerly referred to as MAX Inference Cluster) is a Kubernetes-native distributed AI serving tool that makes it easier to run and manage LLMs at scale using MAX as a backend for optimal model performance. It's built on the [Modular Platform](/max/intro) and is designed to give you efficient use of your hardware with minimal configuration, even when running multiple models across thousands of nodes. The Mammoth control plane automatically selects the best available hardware to meet performance targets when deploying a model and supports both manual and automatic scaling. Mammoth's built-in router intelligently distributes traffic, taking into account hardware load, GPU memory, and caching states. You can deploy and serve multiple models simultaneously across different hardware types or versions without complex setup or duplication of infrastructure. :::note Mammoth is not yet generally available. [Get in touch](https://www.modular.com/company/talk-to-us) to learn about early access for enterprise teams. ::: Spin up an inference cluster and deploy models at scale from your CLI. ## Become a design partner Mammoth is currently only available through Modular's early access program where we're actively partnering with select organizations as design partners. Design partners collaborate directly with Modular's engineering and product teams, gain early access to in-development features, and receive tailored guidance on integrating the Modular Platform into their existing generative AI workloads. --- ## managed_tensor_slice Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations. ## Aliases ### `InputTensor` `alias InputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]` ### `InputVariadicTensors` `alias InputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]` ### `OutputTensor` `alias OutputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]` ### `OutputVariadicTensors` `alias OutputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]` ## Structs * [​`DynamicTensor`](/max/api/mojo/tensor/managed_tensor_slice/DynamicTensor): * [​`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice): A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer. * [​`VariadicTensors`](/max/api/mojo/tensor/managed_tensor_slice/VariadicTensors): A tuple-like container of tensors representing variadic arguments from the graph compiler. ## Functions * [​`foreach`](/max/api/mojo/tensor/managed_tensor_slice/foreach): Apply the function `func` to each element of the tensor slice. * [​`rebuild_mix_precision_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_input_lambda): * [​`rebuild_mix_precision_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_output_lambda): * [​`rebuild_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_input_lambda): * [​`rebuild_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_output_lambda): * [​`trace_slice_arg`](/max/api/mojo/tensor/managed_tensor_slice/trace_slice_arg): Helper to stringify the type and shape of a kernel argument for tracing. --- ## managed_tensor_slice_to_ndbuffer `managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[type, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]` --- ## managed_tensor_slice_to_ndbuffer `managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[type, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]` --- ## managed_tensor_slice_to_ndbuffer `managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[type, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]` --- ## ManagedTensorSlice `@register_passable(trivial)` `struct ManagedTensorSlice[mut: Bool, input: IO, type: DType, rank: Int, //, io_spec: IOSpec[mut, input], *, static_spec: StaticTensorSpec[type, rank]]` A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer. Therefore, the user must take care to keep the pointer alive until the last use of a `ManagedTensorSlice` instance. This class is useful for writing custom operations where memory is managed by an external runtime like in MAX's inference stack. ## Implemented traits `AnyType`, `Copyable`, `DevicePassable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `address_space` `alias address_space = static_spec.address_space` ### `alignment` `alias alignment = static_spec.alignment` ### `device_type` `alias device_type = LayoutTensor[type, static_spec.to_layout(), MutableAnyOrigin]` ### `exclusive` `alias exclusive = static_spec.exclusive` ## Methods ### `__init__` `__init__(ptr: UnsafePointer[SIMD[type, 1]], slices: InlineArray[Slice, rank], slicer_spec: RuntimeTensorSpec[type, rank]) -> Self` Initializes a ManagedTensorSlice from a pointer, array of slices and tensor spec. In general, custom operations should not create `ManagedTensorSlice` instances, but instead use the ones provided by the MAX inference engine. `__init__(ptr: UnsafePointer[SIMD[type, 1]], shape: IndexList[rank]) -> Self` Initializes a ManagedTensorSlice from a pointer and shape. In general, custom operations should not create `ManagedTensorSlice` instances, but instead use the ones provided by the MAX inference engine. `__init__(ptr: UnsafePointer[SIMD[type, 1]], shape: IndexList[rank], strides: IndexList[rank]) -> Self` Initializes a ManagedTensorSlice from a pointer, shape, and strides. In general, custom operations should not create `ManagedTensorSlice` instances, but instead use the ones provided by the MAX inference engine. ### `__getitem__` `__getitem__(self, indices: IndexList[rank]) -> SIMD[type, 1]` Gets the value at the specified indices. **Args:** * ​indices (`IndexList[rank]`): The indices of the value to retrieve. **Returns:** The value at the specified indices. `__getitem__(self, *indices: Int) -> SIMD[type, 1]` Gets the value at the specified indices. **Args:** * ​\*indices (`Int`): The indices of the value to retrieve. **Returns:** The value at the specified indices. ### `__setitem__` `__setitem__(self, *indices: Int, *, val: SIMD[type, 1])` Stores the value at the specified indices. **Args:** * ​\*indices (`Int`): The indices of the value to store. * ​val (`SIMD[type, 1]`): The value to store. `__setitem__(self, indices: IndexList[rank], val: SIMD[type, 1])` Stores the value at the specified indices. **Args:** * ​indices (`IndexList[rank]`): The indices of the value to store. * ​val (`SIMD[type, 1]`): The value to store. ### `get_type_name` `static get_type_name() -> String` ### `get_device_type_name` `static get_device_type_name() -> String` ### `spec` `spec(self) -> RuntimeTensorSpec[type, rank]` Gets the `TensorSpec` of this tensor slice, which provides meta-data about the tensor slice. **Returns:** The static `TensorSpec` for this tensor slice. ### `shape` `shape(self) -> IndexList[rank]` Gets the shape of this tensor slice, as an `IndexList`. **Returns:** The shape of this tensor slice. ### `dim_size` `dim_size(self, index: Int) -> Int` Gets the size of a given dimension of this tensor slice using a run time value. **Args:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. `dim_size[index: Int](self) -> Int` Gets the size of a given dimension of this tensor slice using a compile time value. **Parameters:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. ### `strides` `strides(self) -> IndexList[rank]` Gets the strides of this tensor slice, as an `IndexList`. **Returns:** The strides of this tensor slice. ### `stride_length` `stride_length(self, index: Int) -> Int` Gets the length of the stride of a given dimension of this tensor slice using a run time value. **Args:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. `stride_length[index: Int](self) -> Int` Gets the length of the stride of a given dimension of this tensor slice using a compile time value. **Parameters:** * ​index (`Int`): The zero-based index of the dimension. **Returns:** The size of the tensor slice in the given dimension. ### `size` `size(self) -> Int` Computes the tensor slice's number of elements. **Returns:** The total number of elements in the tensor slice. ### `unsafe_ptr` `unsafe_ptr[__type: DType = type](self) -> UnsafePointer[SIMD[__type, 1]]` Get the pointer stored in this tensor slice. Since this method obtains the pointer stored in this tensor slice, it can modify the invariants of this tensor slice and lead to unexpected behavior. It should be used with caution. **Parameters:** * ​\_\_type (`DType`): The type of the `UnsafePointer` in this tensor slice. **Returns:** The `UnsafePointer` which contains the data for this tensor slice. ### `load` `load[width: Int, _rank: Int](self, index: IndexList[_rank]) -> SIMD[type, width]` Gets data from this tensor slice as a `SIMD`. **Parameters:** * ​width (`Int`): The width of the `SIMD` value. This must be large enough to contain the data from this tensor slice. * ​\_rank (`Int`): The rank of the tensor slice. **Args:** * ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to obtain data from. **Returns:** Data from this tensor slice at dimension `index`. ### `store` `store[width: Int, _rank: Int, element_alignment: Int = 1](self: ManagedTensorSlice[io_spec, static_spec=static_spec], index: IndexList[_rank], val: SIMD[type, width])` Sets data in this tensor slice from a `SIMD`. **Parameters:** * ​width (`Int`): The width of the `SIMD` value. * ​\_rank (`Int`): The rank of the tensor slice. * ​element\_alignment (`Int`): Indicate the alignment of the pointer stored to memory. This is needed to issue vector store for GPUs with strict alignment requirements. **Args:** * ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to set data in. * ​val (`SIMD[type, width]`): The data to set into this tensor slice. ### `with_layout` `with_layout[new_rank: Int, //, new_static_shape: DimList, new_static_strides: DimList](self, new_runtime_shape: IndexList[new_rank], new_runtime_strides: IndexList[new_rank], offset_ptr: OptionalReg[UnsafePointer[SIMD[type, 1]]] = OptionalReg[UnsafePointer[SIMD[type, 1]]]({:i1 0, 1})) -> ManagedTensorSlice[io_spec, static_spec=static_spec.with_layout[::Int](new_static_shape, new_static_strides)]` ### `to_layout_tensor` `to_layout_tensor(self) -> LayoutTensor[type, static_spec.to_layout(), MutableAnyOrigin]` ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this buffer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Gets the buffer as a string. **Returns:** A compact string representation of the buffer. ### `__str__` `__str__(self) -> String` Gets the buffer as a string. **Returns:** A compact string of the buffer. --- ## manager Abstract base class for KVCacheManager for KV Cache. ## `KVCacheInputSymbols` {#max.nn.kv_cache.manager.KVCacheInputSymbols} > *class* max.nn.kv\_cache.manager.KVCacheInputSymbols Base class for input symbols for KV cache managers. The derived class is responsible for defining the input symbols for the specific KV cache manager. For example, here’s a derived class for a text KV cache manager: ```python @dataclass class ContinuousBatchingKVCacheInputSymbols(KVCacheInputSymbols): kv_blocks: TensorType cache_lengths: TensorType lookup_table: TensorType max_lengths: TensorType ``` ## `KVCacheInputs` {#max.nn.kv_cache.manager.KVCacheInputs} > *class* max.nn.kv\_cache.manager.KVCacheInputs A base class that holds KV cache related (Tensor) inputs. It is meant to be subclassed by concrete KV cache input types. For example, here’s a derived class for a text KV cache manager: ```python @dataclass class RaggedKVCacheInputs(KVCacheInputs): blocks: Tensor cache_lengths: Tensor lookup_table: Tensor max_lengths: Tensor ``` ## `KVCacheInputsSequence` {#max.nn.kv_cache.manager.KVCacheInputsSequence} > *class* max.nn.kv\_cache.manager.KVCacheInputsSequence(kv\_cache\_inputs) `KVCacheInputsSequence` is a sequence of [`KVCacheInputs`](#max.nn.kv_cache.manager.KVCacheInputs). It is primarily used in our multistep execution to represent batched KVCacheInputs. **Parameters:** **kv\_cache\_inputs** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`KVCacheInputs`](#max.nn.kv_cache.manager.KVCacheInputs) `]` ) ### `kv_cache_inputs` {#max.nn.kv_cache.manager.KVCacheInputsSequence.kv_cache_inputs} > kv\_cache\_inputs\*: [Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[KVCacheInputs](#max.nn.kv_cache.manager.KVCacheInputs)]\* ## `KVCacheManager` {#max.nn.kv_cache.manager.KVCacheManager} > *class* max.nn.kv\_cache.manager.KVCacheManager(params, max\_batch\_size, max\_seq\_len, num\_layers, devices, session, is\_ragged=False) **Parameters:** * **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** (`Sequence` `[` [`Device`](../../driver.md#max.driver.Device) `]` ) * **session** ([`InferenceSession`](../../engine.md#max.engine.InferenceSession) ) * **is\_ragged** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `claim()` {#max.nn.kv_cache.manager.KVCacheManager.claim} > claim(n) Claims `n` blocks of memory in the cache for incoming requests. This returns a list of sequence ids, which identify a sequence’s location within the cache. This sequence id can then be passed in the fetch function to return the `ContinuousBatchingKVCacheCollection` for those sequences. **Parameters:** **n** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[int](https://docs.python.org/3/library/functions.html#int)] ### `contains()` {#max.nn.kv_cache.manager.KVCacheManager.contains} > contains(seq\_id) **Parameters:** **seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ### `estimated_memory_size()` {#max.nn.kv_cache.manager.KVCacheManager.estimated_memory_size} > *abstract classmethod* estimated\_memory\_size(params, max\_batch\_size, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs) Returns the estimated total memory usage of the kv cache. **Parameters:** * **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` ) * **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `external_claim()` {#max.nn.kv_cache.manager.KVCacheManager.external_claim} > external\_claim(seq\_ids) Variant of the above where sequence ids are reserved externally. **Parameters:** **seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) **Return type:** None ### `fetch()` {#max.nn.kv_cache.manager.KVCacheManager.fetch} > *abstract* fetch(batch, num\_steps=1) Returns blocks and other inputs to kv cache kernel for given sequence ids and prompts. **Parameters:** * **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` ) * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*KVCacheInputs*](#max.nn.kv_cache.manager.KVCacheInputs)] ### `increment_cache_lengths()` {#max.nn.kv_cache.manager.KVCacheManager.increment_cache_lengths} > increment\_cache\_lengths(kv\_cache\_inputs, prev\_model\_inputs) Prepare the inputs for a multistep execution, generally by incrementing the cache lengths. This should not require a device synchronization, as this would defeat the purpose of multistep execution. This should also not update the cache lengths in our manager, this batch is still considered in-progress. **Parameters:** * **kv\_cache\_inputs** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`RaggedKVCacheInputs`](#max.nn.kv_cache.manager.RaggedKVCacheInputs) `]` `|` [`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`PaddedKVCacheInputs`](#max.nn.kv_cache.manager.PaddedKVCacheInputs) `]` ) * **prev\_model\_inputs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*RaggedKVCacheInputs*](#max.nn.kv_cache.manager.RaggedKVCacheInputs)] | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*PaddedKVCacheInputs*](#max.nn.kv_cache.manager.PaddedKVCacheInputs)] ### `infer_optimal_batch_size()` {#max.nn.kv_cache.manager.KVCacheManager.infer_optimal_batch_size} > *abstract classmethod* infer\_optimal\_batch\_size(params, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs) Returns the estimated optimal batch size for the kv cache. **Parameters:** * **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` ) * **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `input_symbols()` {#max.nn.kv_cache.manager.KVCacheManager.input_symbols} > *abstract* input\_symbols() Returns the input symbols for the kv cache manager. **Return type:** [*Sequence*](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[*KVCacheInputSymbols*](#max.nn.kv_cache.manager.KVCacheInputSymbols)] ### `num_kv_inputs()` {#max.nn.kv_cache.manager.KVCacheManager.num_kv_inputs} > num\_kv\_inputs() Returns the default number of KV cache inputs for KV managers. Subclasses with a different number of KV cache inputs should override this method and [`increment_cache_lengths`](#max.nn.kv_cache.manager.KVCacheManager.increment_cache_lengths). **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `release()` {#max.nn.kv_cache.manager.KVCacheManager.release} > release(seq\_id) Release `seq_id` provided, marking this sequence as complete. This returns the `seq_id` back to the available pool of cache memory, allowing it to be reused when a new sequence is claimed. **Parameters:** **seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** None ### `slots_remaining` {#max.nn.kv_cache.manager.KVCacheManager.slots_remaining} > *property* slots\_remaining\*: [set](https://docs.python.org/3/library/stdtypes.html#set)\[[int](https://docs.python.org/3/library/functions.html#int)]\* The outstanding cache slots available. ### `step()` {#max.nn.kv_cache.manager.KVCacheManager.step} > step(batch) Commit the new tokens into the prefix cache. This is a no-op if prefix caching is disabled. **Parameters:** **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` ) **Return type:** None ## `PaddedKVCacheInputs` {#max.nn.kv_cache.manager.PaddedKVCacheInputs} > *class* max.nn.kv\_cache.manager.PaddedKVCacheInputs(k\_cache, v\_cache, start\_pos, null\_op) `PaddedKVCacheInputs` is a class that holds the inputs for KV cache when used together with padded tensors. **Parameters:** * **k\_cache** ([`Tensor`](../../driver.md#max.driver.Tensor) ) * **v\_cache** ([`Tensor`](../../driver.md#max.driver.Tensor) ) * **start\_pos** ([`Tensor`](../../driver.md#max.driver.Tensor) ) * **null\_op** ([`Tensor`](../../driver.md#max.driver.Tensor) ) ### `k_cache` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.k_cache} > k\_cache\*: [Tensor](../../driver.md#max.driver.Tensor)\* ### `null_op` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.null_op} > null\_op\*: [Tensor](../../driver.md#max.driver.Tensor)\* ### `start_pos` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.start_pos} > start\_pos\*: [Tensor](../../driver.md#max.driver.Tensor)\* ### `v_cache` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.v_cache} > v\_cache\*: [Tensor](../../driver.md#max.driver.Tensor)\* ## `RaggedKVCacheInputs` {#max.nn.kv_cache.manager.RaggedKVCacheInputs} > *class* max.nn.kv\_cache.manager.RaggedKVCacheInputs(blocks, cache\_lengths, lookup\_table, max\_lengths) `RaggedKVCacheInputs` is a class that holds the inputs for KV cache when used together with ragged tensors. **Parameters:** * **blocks** ([`Tensor`](../../driver.md#max.driver.Tensor) ) * **cache\_lengths** ([`Tensor`](../../driver.md#max.driver.Tensor) ) * **lookup\_table** ([`Tensor`](../../driver.md#max.driver.Tensor) ) * **max\_lengths** ([`Tensor`](../../driver.md#max.driver.Tensor) ) ### `blocks` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.blocks} > blocks\*: [Tensor](../../driver.md#max.driver.Tensor)\* ### `cache_lengths` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.cache_lengths} > cache\_lengths\*: [Tensor](../../driver.md#max.driver.Tensor)\* ### `lookup_table` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.lookup_table} > lookup\_table\*: [Tensor](../../driver.md#max.driver.Tensor)\* ### `max_lengths` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.max_lengths} > max\_lengths\*: [Tensor](../../driver.md#max.driver.Tensor)\* --- ## map `map[origins: origin.set, //, func: fn(Int) capturing -> None](size: Int)` Maps a function over a range from 0 to size. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): Function to map. **Args:** * ​size (`Int`): The number of elements. --- ## map_reduce `map_reduce[simd_width: Int, size: Dim, type: DType, acc_type: DType, origins_gen: origin.set, input_gen_fn: fn[DType, Int](Int) capturing -> SIMD[$0, $1], origins_vec: origin.set, reduce_vec_to_vec_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_vec_to_scalar_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]](dst: NDBuffer[type, 1, origin, __init__[::Intable](size)], init: SIMD[acc_type, 1]) -> SIMD[acc_type, 1]` Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function. **Parameters:** * ​simd\_width (`Int`): The vector width for the computation. * ​size (`Dim`): The buffer size. * ​type (`DType`): The buffer elements dtype. * ​acc\_type (`DType`): The dtype of the reduction accumulator. * ​origins\_gen (`origin.set`): The OriginSet of captured arguments by the input\_gen\_fn. * ​input\_gen\_fn (`fn[DType, Int](Int) capturing -> SIMD[$0, $1]`): A function that generates inputs to reduce. * ​origins\_vec (`origin.set`): The OriginSet of captured arguments by the reduce\_vec\_to\_vec\_fn. * ​reduce\_vec\_to\_vec\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used to combine (accumulate) two chunks of input data: e.g. we load two `8xfloat32` vectors of elements and need to reduce them into a single `8xfloat32` vector. * ​reduce\_vec\_to\_scalar\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to reduce a vector to a scalar. E.g. when we got `8xfloat32` vector and want to reduce it to an `float32` scalar. **Args:** * ​dst (`NDBuffer[type, 1, origin, __init__[::Intable](size)]`): The output buffer. * ​init (`SIMD[acc_type, 1]`): The initial value to use in accumulator. **Returns:** The computed reduction value. --- ## masked_load `masked_load[dtype: DType, //, size: Int](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 1) -> SIMD[dtype, size]` Loads data from memory and return it, replacing masked lanes with values from the passthrough vector. **Parameters:** * ​dtype (`DType`): DType of the return SIMD buffer. * ​size (`Int`): Size of the return SIMD buffer. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The base pointer for the load. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of the memory stored at addr. * ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced with the passthrough vector. * ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power of two constant integer value. Default is 1. **Returns:** The loaded memory stored in a vector of type SIMD\[dtype, size]. --- ## masked_store `masked_store[size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], mask: SIMD[bool, size], alignment: Int = 1)` Stores a value at a memory location, skipping masked lanes. **Parameters:** * ​size (`Int`): Size of `value`, the data to store. **Args:** * ​value (`SIMD[dtype, size]`): The vector containing data to store. * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): A vector of memory location to store data at. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of `value`. * ​alignment (`Int`): The alignment of the destination locations. Must be 0 or a power of two constant integer value. --- ## MaskName `struct MaskName` A tile's masking status. ## Fields * ​name (`String`): ## Implemented traits `AnyType`, `Stringable`, `UnknownDestructibility` ## Aliases ### `CAUSAL` `alias CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("causal"))` ### `CHUNKED` `alias CHUNKED = MaskName(__init__[__mlir_type.!kgen.string]("chunked"))` ### `CHUNKED_CAUSAL` `alias CHUNKED_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("chunked_causal"))` ### `MATERIALIZED` `alias MATERIALIZED = MaskName(__init__[__mlir_type.!kgen.string]("materialized"))` ### `NULL` `alias NULL = MaskName(__init__[__mlir_type.!kgen.string]("null"))` ### `SLIDING_WINDOW_CAUSAL` `alias SLIDING_WINDOW_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("sliding_window_causal"))` ## Methods ### `__init__` `__init__(out self, name: String)` ### `__eq__` `__eq__(self, rhs: Self) -> Bool` `__eq__(self, rhs: String) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` ### `__str__` `__str__(self) -> String` --- ## MaterializedMask `@register_passable(trivial)` `struct MaterializedMask[type_: DType, rank_: Int, shape_: DimList]` Mask that's backed by a materialized tensor. ## Fields * ​mask\_tensor (`NDBuffer[type_, rank_, MutableAnyOrigin, shape_]`): * ​start\_pos (`OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]`): * ​is\_multiple\_of\_2 (`Bool`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = True` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = False` ### `MaskType` `alias MaskType = NDBuffer[type_, rank_, MutableAnyOrigin, shape_]` ## Methods ### `__init__` `__init__(mask_tensor: NDBuffer[type_, rank_, MutableAnyOrigin, shape_], start_pos: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1})) -> Self` ### `get_start_pos` `get_start_pos(self, batch_idx: Int) -> Int` ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## matfp `matfp(gpr: Int)` Float16 matrix multiply. --- ## math Implements math methods that work on layout tensors. ## Functions * [​`max`](./max): Computes maximum reduction along specified axis. * [​`outer_product_acc`](./outer_product_acc): Updates result tensor with the outer product of two vectors. * [​`sum`](./sum): Computes sum reduction along specified axis. --- ## math Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Absable`](/mojo/stdlib/builtin/math/Absable): The `Absable` trait describes a type that defines an absolute value operation. * [​`Powable`](/mojo/stdlib/builtin/math/Powable): The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types. * [​`Roundable`](/mojo/stdlib/builtin/math/Roundable): The `Roundable` trait describes a type that defines a rounding operation. ## Functions * [​`abs`](/mojo/stdlib/builtin/math/abs): Get the absolute value of the given object. * [​`divmod`](/mojo/stdlib/builtin/math/divmod): Performs integer division and returns the quotient and the remainder. * [​`max`](/mojo/stdlib/builtin/math/max): Gets the maximum of two integers. * [​`min`](/mojo/stdlib/builtin/math/min): Gets the minimum of two integers. * [​`pow`](/mojo/stdlib/builtin/math/pow): Computes the `base` raised to the power of the `exp`. * [​`round`](/mojo/stdlib/builtin/math/round): Get the rounded value of the given object. --- ## math Implements the math package. ## Modules * [​`constants`](/mojo/stdlib/math/constants/): Defines math utilities. * [​`math`](/mojo/stdlib/math/math/): Defines math utilities. * [​`polynomial`](/mojo/stdlib/math/polynomial/): Provides two implementations for evaluating polynomials. --- ## math Defines math utilities. You can import these APIs from the `math` package. For example: ```mojo from math import floor ``` ## Traits * [​`Ceilable`](/mojo/stdlib/math/math/Ceilable): The `Ceilable` trait describes a type that defines a ceiling operation. * [​`CeilDivable`](/mojo/stdlib/math/math/CeilDivable): The `CeilDivable` trait describes a type that defines a ceil division operation. * [​`CeilDivableRaising`](/mojo/stdlib/math/math/CeilDivableRaising): The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise. * [​`Floorable`](/mojo/stdlib/math/math/Floorable): The `Floorable` trait describes a type that defines a floor operation. * [​`Truncable`](/mojo/stdlib/math/math/Truncable): The `Truncable` trait describes a type that defines a truncation operation. ## Functions * [​`acos`](/mojo/stdlib/math/math/acos): Computes the `acos` of the inputs. * [​`acosh`](/mojo/stdlib/math/math/acosh): Computes the `acosh` of the inputs. * [​`align_down`](/mojo/stdlib/math/math/align_down): Returns the closest multiple of alignment that is less than or equal to value. * [​`align_up`](/mojo/stdlib/math/math/align_up): Returns the closest multiple of alignment that is greater than or equal to value. * [​`asin`](/mojo/stdlib/math/math/asin): Computes the `asin` of the inputs. * [​`asinh`](/mojo/stdlib/math/math/asinh): Computes the `asinh` of the inputs. * [​`atan`](/mojo/stdlib/math/math/atan): Computes the `atan` of the inputs. * [​`atan2`](/mojo/stdlib/math/math/atan2): Computes the `atan2` of the inputs. * [​`atanh`](/mojo/stdlib/math/math/atanh): Computes the `atanh` of the inputs. * [​`cbrt`](/mojo/stdlib/math/math/cbrt): Computes the `cbrt` of the inputs. * [​`ceil`](/mojo/stdlib/math/math/ceil): Get the ceiling value of the given object. * [​`ceildiv`](/mojo/stdlib/math/math/ceildiv): Return the rounded-up result of dividing numerator by denominator. * [​`clamp`](/mojo/stdlib/math/math/clamp): Clamps the integer value vector to be in a certain range. * [​`copysign`](/mojo/stdlib/math/math/copysign): Returns a value with the magnitude of the first operand and the sign of the second operand. * [​`cos`](/mojo/stdlib/math/math/cos): Computes the `cos` of the inputs. * [​`cosh`](/mojo/stdlib/math/math/cosh): Computes the `cosh` of the inputs. * [​`erf`](/mojo/stdlib/math/math/erf): Performs the elementwise Erf on a SIMD vector. * [​`erfc`](/mojo/stdlib/math/math/erfc): Computes the `erfc` of the inputs. * [​`exp`](/mojo/stdlib/math/math/exp): Calculates elementwise exponential of the input vector. * [​`exp2`](/mojo/stdlib/math/math/exp2): Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector. * [​`expm1`](/mojo/stdlib/math/math/expm1): Computes the `expm1` of the inputs. * [​`factorial`](/mojo/stdlib/math/math/factorial): Computes the factorial of the integer. * [​`floor`](/mojo/stdlib/math/math/floor): Get the floor value of the given object. * [​`fma`](/mojo/stdlib/math/math/fma): Performs `fma` (fused multiply-add) on the inputs. * [​`frexp`](/mojo/stdlib/math/math/frexp): Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0. * [​`gamma`](/mojo/stdlib/math/math/gamma): Computes the Gamma of the input. * [​`gcd`](/mojo/stdlib/math/math/gcd): Compute the greatest common divisor of two integers. * [​`hypot`](/mojo/stdlib/math/math/hypot): Computes the `hypot` of the inputs. * [​`iota`](/mojo/stdlib/math/math/iota): Creates a SIMD vector containing an increasing sequence, starting from offset. * [​`isclose`](/mojo/stdlib/math/math/isclose): Checks if the two input values are numerically within a tolerance. * [​`isqrt`](/mojo/stdlib/math/math/isqrt): Performs elementwise reciprocal square root on a SIMD vector. * [​`j0`](/mojo/stdlib/math/math/j0): Computes the Bessel function of the first kind of order 0 for each input value. * [​`j1`](/mojo/stdlib/math/math/j1): Computes the Bessel function of the first kind of order 1 for each input value. * [​`lcm`](/mojo/stdlib/math/math/lcm): Computes the least common multiple of two integers. * [​`ldexp`](/mojo/stdlib/math/math/ldexp): Computes elementwise ldexp function. * [​`lgamma`](/mojo/stdlib/math/math/lgamma): Computes the `lgamma` of the inputs. * [​`log`](/mojo/stdlib/math/math/log): Performs elementwise natural log (base E) of a SIMD vector. * [​`log10`](/mojo/stdlib/math/math/log10): Computes the `log10` of the inputs. * [​`log1p`](/mojo/stdlib/math/math/log1p): Computes the `log1p` of the inputs. * [​`log2`](/mojo/stdlib/math/math/log2): Performs elementwise log (base 2) of a SIMD vector. * [​`logb`](/mojo/stdlib/math/math/logb): Computes the `logb` of the inputs. * [​`modf`](/mojo/stdlib/math/math/modf): Computes the integral and fractional part of the value. * [​`recip`](/mojo/stdlib/math/math/recip): Performs elementwise reciprocal on a SIMD vector. * [​`remainder`](/mojo/stdlib/math/math/remainder): Computes the `remainder` of the inputs. * [​`scalb`](/mojo/stdlib/math/math/scalb): Computes the `scalb` of the inputs. * [​`sin`](/mojo/stdlib/math/math/sin): Computes the `sin` of the inputs. * [​`sinh`](/mojo/stdlib/math/math/sinh): Computes the `sinh` of the inputs. * [​`sqrt`](/mojo/stdlib/math/math/sqrt): Performs square root on an integer. * [​`tan`](/mojo/stdlib/math/math/tan): Computes the `tan` of the inputs. * [​`tanh`](/mojo/stdlib/math/math/tanh): Performs elementwise evaluation of the tanh function. * [​`trunc`](/mojo/stdlib/math/math/trunc): Get the truncated value of the given object. * [​`ulp`](/mojo/stdlib/math/math/ulp): Computes the ULP (units of last place) or (units of least precision) of the number. * [​`y0`](/mojo/stdlib/math/math/y0): Computes the Bessel function of the second kind of order 0 for each input value. * [​`y1`](/mojo/stdlib/math/math/y1): Computes the Bessel function of the second kind of order 1 for each input value. --- ## matmul ## Structs * [​`TiledMatmul`](./TiledMatmul): Tiled matmul implementation integrating packing, inner loop and tile partitions. ## Traits * [​`InnerMatmulKernel`](./InnerMatmulKernel): ## Functions * [​`elementwise_epilogue_c_tile`](./elementwise_epilogue_c_tile): * [​`matmul`](./matmul): * [​`tiled_matmul_run`](./tiled_matmul_run): Interface function to run tiled matmul on a given sub-tile. --- ## matmul `matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())` `matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: Optional[DeviceContext])` --- ## matmul `matmul[c_type: DType, a_type: DType, b_type: DType, //, use_tensor_core: Bool = False, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], ctx: DeviceContext)` This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library. --- ## matmul `matmul[use_tf32: Bool = False](ctx: DeviceContext, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))` Matmul using the vendor BLAS library. With a global handle. `matmul[use_tf32: Bool = False](ctx: DeviceContext, handle: Handle[backend], c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))` --- ## matmul_allreduce `matmul_allreduce[ngpus: Int, partition_dim: Int, num_partitions: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None, type: DType, a_static_shape: DimList, b_static_shape: DimList, c_static_shape: DimList, out_static_shape: DimList](a_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, a_static_shape], ngpus], b_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, b_static_shape], ngpus], c_temp_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, c_static_shape], ngpus], output_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, out_static_shape], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext])` Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation. --- ## matmul_default ## Structs * [​`Inner_matmul_default`](./Inner_matmul_default): --- ## matmul_dynamic_scaled_fp8 `matmul_dynamic_scaled_fp8[c_type: DType, a_type: DType, b_type: DType, a_scales_type: DType, b_scales_type: DType, //, transpose_b: Bool = False, config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], a_scales: NDBuffer[a_scales_type, 2, origin, shape], b_scales: NDBuffer[b_scales_type, 2, origin, shape], ctx: DeviceContext)` --- ## matmul_gpu ## Structs * [​`AMDSchedulerTuning`](./AMDSchedulerTuning): ## Functions * [​`__nvvm_ldg_f4`](./__nvvm_ldg_f4): * [​`matmul_kernel`](./matmul_kernel): Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access. * [​`matmul_kernel_naive`](./matmul_kernel_naive): * [​`multistage_gemm`](./multistage_gemm): * [​`split_k_reduce`](./split_k_reduce): --- ## matmul_gpu_qint4 `matmul_gpu_qint4[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())` --- ## matmul_gpu_qint4_impl `matmul_gpu_qint4_impl[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: Optional[DeviceContext])` --- ## matmul_i8mm ## Structs * [​`Inner_matmul_i8mm`](./Inner_matmul_i8mm): * [​`LoadStore_i8mm`](./LoadStore_i8mm): --- ## matmul_kernel `matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access. --- ## matmul_kernel_naive `matmul_kernel_naive[c_type: DType, a_type: DType, b_type: DType, BLOCK_DIM: Int, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)` --- ## matmul_neon ## Structs * [​`Inner_matmul_neon`](./Inner_matmul_neon): --- ## matmul_Q4_K `matmul_Q4_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])` --- ## matmul_Q4_K_pack_b `matmul_Q4_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])` --- ## matmul_Q6_K `matmul_Q6_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])` --- ## matmul_Q6_K_pack_b `matmul_Q6_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])` --- ## matmul_qint4 `matmul_qint4[group_size: Int, b_static_shape: DimList = create_unknown[::Int](), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin, b_static_shape], c: NDBuffer[float32, 2, origin])` --- ## matmul_qint4_pack_b `matmul_qint4_pack_b[group_size: Int](b: NDBuffer[uint8, 2, origin], b_rot: NDBuffer[uint8, 2, origin])` --- ## matmul_sm90 ## Aliases ### `NumWarpPerWarpGroup` `alias NumWarpPerWarpGroup = 4` ### `WARP_GROUP_SIZE` `alias WARP_GROUP_SIZE = 128` ## Functions * [​`cluster_size`](./cluster_size): * [​`consumer_main_loop`](./consumer_main_loop): * [​`hopper_matmul_tma_wgmma`](./hopper_matmul_tma_wgmma): * [​`hopper_matmul_tma_wgmma_kernel`](./hopper_matmul_tma_wgmma_kernel): * [​`producer_main_loop`](./producer_main_loop): * [​`promote_to_cuda_cores`](./promote_to_cuda_cores): * [​`tma_wgmma_warp_specialized_gemm_kernel`](./tma_wgmma_warp_specialized_gemm_kernel): * [​`tma_wgmma_warp_specialized_gemm_kernel_persistent`](./tma_wgmma_warp_specialized_gemm_kernel_persistent): * [​`warp_specialize_gemm_with_multicasting`](./warp_specialize_gemm_with_multicasting): * [​`warp_specialized_gemm_output`](./warp_specialized_gemm_output): --- ## matmul_tile_scheduler ## Structs * [​`MatmulSchedule`](./MatmulSchedule): * [​`TileScheduler`](./TileScheduler): * [​`WorkInfo`](./WorkInfo): --- ## matmul_vendor ## Functions * [​`matmul`](./matmul): This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library. --- ## matmul_vnni ## Structs * [​`Inner_matmul_vnni`](./Inner_matmul_vnni): --- ## MatmulConfig `@register_passable(trivial)` `struct MatmulConfig[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False, mma_shape: IndexList[3] = get_mma_shape[::DType,::DType,::Int]()]` Static configuration of GPU matmul. ## Fields * ​block\_tile\_shape (`IndexList[3]`): * ​warp\_tile\_shape (`IndexList[3]`): * ​num\_pipeline\_stages (`UInt`): * ​num\_k\_partitions (`UInt`): * ​k\_group\_size (`UInt`): * ​num\_warp\_k\_partitions (`UInt`): * ​cluster\_shape (`IndexList[3]`): * ​num\_consumer (`UInt`): * ​partitioned\_multicast (`Bool`): * ​scheduler\_hint (`IndexList[3]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `ACCUM_PRECISION` `alias ACCUM_PRECISION = 1` ### `accum_type` `alias accum_type = get_accum_type[::DType,::DType]()` ### `OUTPUT_PRECISION` `alias OUTPUT_PRECISION = 2` ### `split_k_reduction_scheme` `alias split_k_reduction_scheme = env_get_int[::StringSlice[::Bool()` ### `split_k_reduction_type` `alias split_k_reduction_type = c_type if (env_get_int[::StringSlice[::Bool() == 2) else get_accum_type[::DType,::DType]()` ## Methods ### `__init__` `__init__(block_tile_shape: IndexList[3] = Index(128, 128, 32), warp_tile_shape: IndexList[3] = Index(64, 64, 32), cluster_shape: IndexList[3] = Index(1, 1, 1), num_pipeline_stages: UInt = UInt(4), num_k_partitions: UInt = UInt(1), k_group_size: UInt = UInt(1), num_warp_k_partitions: UInt = UInt(1), num_consumer: UInt = UInt(1), partitioned_multicast: Bool = False, scheduler_hint: IndexList[3] = Index(2, 2, 2), pdl_level: PDLLevel = PDLLevel()) -> Self` ### `__eq__` `__eq__(self, rhs: MatmulConfig[a_type, b_type, c_type, transpose_b, mma_shape]) -> Bool` ### `num_warps_m` `num_warps_m(self) -> UInt` ### `num_warps_n` `num_warps_n(self) -> UInt` ### `num_threads` `num_threads(self) -> UInt` ### `shared_mem_usage` `shared_mem_usage(self) -> Int` ### `grid_dim` `grid_dim(self, m: UInt, n: UInt) -> IndexList[3]` ### `block_dim` `block_dim(self) -> IndexList[3]` ### `work_space_size` `work_space_size(self, M: UInt, N: UInt) -> UInt` ### `pdl_level` `pdl_level(self) -> PDLLevel` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` ### `__repr__` `__repr__(self) -> String` ### `__hash__` `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. --- ## MatmulKernels `@register_passable(trivial)` `struct MatmulKernels[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False]` Supported matmul kernels. The configurations are named as: **. BK, mma shape, and warp tile shape are decided internally. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ampere_128x128_4` `alias ampere_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `ampere_256x128_3` `alias ampere_256x128_3 = MatmulConfig(Index(128, 256, (_bk_base[::DType,::Bool]() * 2)), Index(64, 64, (_bk_base[::DType,::Bool]() * 2)), Index(1, 1, 1), UInt(3), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `ampere_256x64_4` `alias ampere_256x64_4 = MatmulConfig(Index(64, 256, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `hopper_128x128_4` `alias hopper_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_128x128_1` `alias mi300x_128x128_1 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_128x128_2` `alias mi300x_128x128_2 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(2), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_128x256_1` `alias mi300x_128x256_1 = MatmulConfig(Index(128, 256, _bk_base[::DType,::Bool]()), Index(64, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 4, 2), PDLLevel())` ### `mi300x_192x256_1` `alias mi300x_192x256_1 = MatmulConfig(Index(192, 256, _bk_base[::DType,::Bool]()), Index(96, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 6, 2), PDLLevel())` ### `mi300x_224x256_1` `alias mi300x_224x256_1 = MatmulConfig(Index(224, 256, _bk_base[::DType,::Bool]()), Index(112, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 7, 2), PDLLevel())` ### `mi300x_256x256_1` `alias mi300x_256x256_1 = MatmulConfig(Index(256, 256, _bk_base[::DType,::Bool]()), Index(128, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 8, 2), PDLLevel())` ### `mi300x_64x64_1` `alias mi300x_64x64_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `mi300x_64x64_splitk_1` `alias mi300x_64x64_splitk_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(4), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())` ### `tuning_config` `alias tuning_config = MatmulConfig(Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(1, 1, 1), UInt(env_get_int[::StringSlice[::Bool()), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), False, Index(2, 2, 2), PDLLevel())` --- ## MatmulSchedule `@register_passable(trivial)` `struct MatmulSchedule` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NONE` `alias NONE = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1))` ### `TILE1D` `alias TILE1D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](0))` ### `TILE2D` `alias TILE2D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## matrix_band_part The module implements matrix band part functions. ## Functions * [​`matrix_band_part`](./matrix_band_part): --- ## matrix_band_part `matrix_band_part[: origin.set, //, type: DType, int_type: DType, cond_type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], simd_width: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[rank], num_lower: NDBuffer[int_type, 1, origin], num_upper: NDBuffer[int_type, 1, origin], exclude_buf: NDBuffer[cond_type, 1, origin], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)` --- ## max The MAX Mojo API reference. The MAX API provides a state-of-the-art graph compiler and runtime library that executes AI models with incredible speed on a wide range of hardware. ## Packages * [​`tensor`](/max/api/mojo/tensor/): APIs to create and manage tensors in a graph. --- ## max The MAX Python API reference. The MAX API provides a state-of-the-art graph compiler and runtime library that executes AI models with incredible speed on a wide range of hardware. ## Modules * [`driver`](/max/api/python/driver): APIs to interact with devices. * [`dtype`](/max/api/python/dtype): APIs to define data types. * [`engine`](/max/api/python/engine): APIs to load and execute models. * [`entrypoints`](/max/api/python/entrypoints): APIs to run MAX models. * [`torch`](/max/api/python/torch): APIs to use custom ops with PyTorch. ## Packages * [`graph`](/max/api/python/graph): APIs to build models (inference graphs). * [`pipelines`](/max/api/python/pipelines): APIs to build model pipelines. * [`nn`](/max/api/python/nn): APIs to build MAX NN models. --- ## max `max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Computes maximum reduction along specified axis. Reduces the input tensor by taking maximum elements along the specified axis and stores the result in the output tensor. **Constraints:** All tensors must have statically known shapes. `out.rank` must equal `inp.rank - 1`. Non-reduction dimensions must match between `inp` and `out`. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to take maximum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce. * ​out (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store maximum results. `max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Computes maximum reduction along specified axis, returning a new tensor. Reduces the input tensor by taking maximum elements along the specified axis and returns a new tensor with the results. **Constraints:** All tensors must have statically known shapes. Result will have rank equal to `inp.rank` - 1. Non-reduction dimensions in the result match the input. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to take maximum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce. **Returns:** A new tensor containing the maximum values along the specified axis. `max[dtype: DType, layout: Layout](x: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], y: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]` Computes element-wise maximum of two tensors. Returns a new tensor containing the element-wise maximum between the input tensors. **Constraints:** Input tensors must have statically known shapes and matching layouts. **Parameters:** * ​dtype (`DType`): The data type of the input tensors. * ​layout (`Layout`): The layout of the input tensors. **Args:** * ​x (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): First input tensor. * ​y (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Second input tensor. **Returns:** A new tensor containing the element-wise maximum. --- ## max `max(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the max element in a buffer. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The maximum of the buffer elements. `max[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the max across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `max[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the max across the input and output shape. This performs the max computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the max on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## max `max(x: Int, y: Int, /) -> Int` Gets the maximum of two integers. **Args:** * ​x (`Int`): Integer input to max. * ​y (`Int`): Integer input to max. **Returns:** Maximum of x and y. `max(x: UInt, y: UInt, /) -> UInt` Gets the maximum of two integers. **Args:** * ​x (`UInt`): Integer input to max. * ​y (`UInt`): Integer input to max. **Returns:** Maximum of x and y. `max[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]` Performs elementwise maximum of x and y. An element of the result SIMD vector will be the maximum of the corresponding elements in x and y. **Constraints:** The type of the inputs must be numeric or boolean. **Parameters:** * ​dtype (`DType`): The data type of the SIMD vector. **Args:** * ​x (`SIMD[dtype, size]`): First SIMD vector. * ​y (`SIMD[dtype, size]`): Second SIMD vector. **Returns:** A SIMD vector containing the elementwise maximum of x and y. `max[T: Copyable & GreaterThanComparable](x: T, *ys: T) -> T` Gets the maximum value from a sequence of values. **Parameters:** * ​T (`Copyable & GreaterThanComparable`): A type that is both copyable and comparable with greater than. **Args:** * ​x (`T`): The first value to compare. * ​\*ys (`T`): Zero or more additional values to compare. **Returns:** The maximum value from the input sequence. --- ## max `max[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]` Computes the maximum value across all threads in a block. Performs a parallel reduction using warp-level operations and shared memory to find the global maximum across all threads in the block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. * ​broadcast (`Bool`): If True, the final reduced value is broadcast to all threads in the block. If False, only the first thread will have the complete result. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find the maximum. **Returns:** If broadcast is True, each thread in the block will receive the maximum value across the entire block. Otherwise, only the first thread will have the complete result. --- ## max `max[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the maximum value across all lanes in a warp. This is a convenience wrapper around lane\_group\_max that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global maximum value across all lanes in the warp. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum. **Returns:** A SIMD value where all lanes contain the maximum value found across the entire warp. --- ## max CLI import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; The `max` CLI tool accelerates GenAI tasks by creating optimized inference pipelines with [OpenAI-compatible endpoints](https://platform.openai.com/docs/api-reference/introduction). It supports models from [Hugging Face](https://builds.modular.com/?category=models) and [MAX Graph](/max/model-formats.mdx#max-graph) optimized versions of models like Llama 3.1, Mistral, and Replit Code. Generate text or start an OpenAI-compatible endpoint with a single command using the `max` CLI tool. :::note The `max-pipelines` CLI tool has been renamed to `max`. Users should switch to using the `max` CLI tool. The underlying implementation remains identical with the same commands and flags, so your existing workflows will continue to work as expected. ::: ## Install Create a Python project to install our APIs and the `max` CLI. When you install the `modular` package, you'll get access to the `max` CLI tool automatically. You can check your version like this: ```sh max --version ``` ## Run your first model Now that you have `max` installed, you can run your first model: ```sh max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --prompt "Generate a story about a robot" ``` :::note If you use private or gated models, you must set your [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) first. For example: ```sh export HF_TOKEN="hf_..." ``` Then you can run commands in `max` for a private or gated model. ::: ## Uninstall To remove the `modular` Python package: ```sh pip uninstall modular ``` ```sh uv pip uninstall modular ``` ```sh magic remove modular ``` ## Commands `max` provides the following commands. You can also print the available commands and documentation with `--help`. For example: ```sh max --help ``` ```sh max serve --help ``` ### `encode` Converts input text into embeddings for semantic search, text similarity, and NLP applications. ```sh max encode [OPTIONS] ``` **Example** Basic embedding generation: ```sh max encode \ --model-path sentence-transformers/all-MiniLM-L6-v2 \ --prompt "Convert this text into embeddings" ``` ### `generate` Performs text generation based on a provided prompt. ```sh max generate [OPTIONS] ``` **Examples** Text generation: ```sh max generate \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --max-length 1024 \ --max-new-tokens 100 \ --prompt "Generate a story about a robot" ``` Text generation with controls: ```sh max generate \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --max-length 1024 \ --max-new-tokens 500 \ --top-k 40 \ --quantization-encoding q4_k \ --cache-strategy paged \ --prompt "Explain quantum computing" ``` Process an image using a vision-language model given a URL to an image: **LLama 3.2 Vision** LLama Vision models take prompts with `` and `` tokens. For more information, see the [LLama 3.2 Vision documentation](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/#-vision-model-inputs-and-outputs-). ```sh max generate \ --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ --prompt "What is in this image?" \ --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \ --max-new-tokens 100 \ --max-batch-size 1 \ --max-length 108172 ``` **Pixtral** Pixtral models take prompts with `[IMG]` tokens. For more information, see the [Pixtral documentation](https://huggingface.co/docs/transformers/en/model_doc/pixtral#pixtral). ```sh max generate \ --model-path mistral-community/pixtral-12b \ --max-length 6491 \ --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \ --prompt "[INST]Describe the images.\n[IMG][/INST]" ``` :::note You can adjust parameters like `--max-batch-size` and `--max-length` depending on your system's available resources such as GPU memory. ::: For more information on how to use the `generate` command with vision models, see [Generate image descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision). ### `list` Displays available model architectures and configurations, including: - Hugging Face model repositories - Supported encoding types - Available cache strategies ```sh max list ``` ### `serve` Launches an OpenAI-compatible REST API server for production deployments. For more detail, see [the Serve API docs](/max/api/serve). ```sh max serve [OPTIONS] ``` **Examples** CPU serving: ```sh max serve \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF ``` Optimized GPU serving: ```sh max serve \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --devices gpu \ --quantization-encoding bfloat16 \ --max-batch-size 4 \ --cache-strategy paged ``` Production setup: ```sh max serve \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --devices gpu:0,1 \ --max-batch-size 8 \ --device-memory-utilization 0.9 ``` **Custom architectures** The `max` CLI supports loading custom model architectures through the `--custom-architectures` flag. This allows you to extend MAX's capabilities with your own model implementations: ```sh max serve \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --custom-architectures path/to/module1:module1 \ --custom-architectures path/to/module2:module2 ``` :::note Custom architectures The `--custom-architectures` flag allows you to load custom pipeline architectures from your own Python modules. You can set the `ARCHITECTURES` variable containing the architecture definitions. Each entry in `--custom-architectures` can be specified in two formats: - A raw module name; for example: `my_module`. - An import path followed by a colon and the module name; for example: `folder/path/to/import:my_module`. The `ARCHITECTURES` variable in your module should be a list of implementations that conform to the [SupportedArchitecture](/max/api/python/pipelines/registry#max.pipelines.lib.registry.SupportedArchitecture) interface. These will be registered with the MAX pipeline registry automatically. ::: ### `warm-cache` Preloads and compiles the model to optimize initialization time by: - Pre-compiling models before deployment - Warming up the Hugging Face cache This command is useful to run before serving a model. ```sh max warm-cache [OPTIONS] ``` Example: Basic cache warming: ```sh max warm-cache \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF ``` :::note The Modular Executable Format (MEF) is platform independent, but the serialized cache (MEF files) produced during compilation is platform-dependent. This is because: - Platform-dependent optimizations happen during compilation. - Fallback operations assume a particular runtime environment. Weight transformations and hashing during MEF caching can impact performance. While efforts to improve this through weight externalization are ongoing, compiled MEF files remain platform-specific and are not generally portable. ::: ### Model configuration Core settings for model loading and execution. | Option | Description | Default | Values | |---------------------------|------------------------------------|---------|---------------------------------------------------------------------------| | `--custom-architectures` | Load custom pipeline architectures | | Module path format: `folder/path/to/import:my_module` | | `--engine` | Backend engine | `max` | `max`\|`huggingface` | | `--model-path TEXT` | (required) Path to model | | Any valid path or Hugging Face repo ID (e.g. `mistralai/Mistral-7B-v0.1`) | | `--quantization-encoding` | Weight encoding type | | `float32`\|`bfloat16`\|`q4_k`\|`q4_0`\|`q6_k`\|`gptq` | | `--weight-path PATH` | Custom model weights path | | Valid file path (supports multiple paths via repeated flags) | :::note Quantization encoding When using GGUF models, quantization encoding formats are automatically detected. If no `--quantization-encoding` is specified, MAX Serve automatically detects and uses the first encoding option from the repository. If quantization encoding is provided, it must align with the available encoding options in the repository. If the repository contains multiple quantization formats, specify which encoding type you want to use with the `--quantization-encoding` parameter. ::: ### Device configuration Controls hardware placement and memory usage. | Option | Description | Default | Values | |-------------------------------|-------------------------------|---------|-------------------------------------------------------------------| | `--devices` | Target devices | | `cpu`\|`gpu`\|`gpu:{id}` (e.g. `gpu:0,1`) | | `--device-specs` | Specific device configuration | `CPU` | `DeviceSpec` format (e.g. `DeviceSpec(id=-1, device_type='cpu')`) | | `--device-memory-utilization` | Device memory fraction | `0.9` | Float between 0.0 and 1.0 | ### Performance tuning Optimization settings for batch processing, caching, and sequence handling. | Option | Description | Default | Values | |------------------------|-------------------------------------|----------------------------------------------------------------------------------------|---------------------------------------------------------| | `--cache-strategy` | Cache strategy | | `naive`\|`continuous` | | `--kv-cache-page-size` | Token count per KVCache page | `128` | Positive integer | | `--max-batch-size` | Maximum cache size per batch | `1` | Positive integer | | `--max-ce-batch-size` | Maximum context encoding batch size | `32` | Positive integer | | `--max-length` | Maximum input sequence length | The Hugging Face model's default max length is used. | Positive integer (must be less than model's max config) | | `--max-new-tokens` | Maximum tokens to generate | `-1` | Integer (-1 for model max) | | `--pad-to-multiple-of` | Input tensor padding multiple | `2` | Positive integer | ### Model state control Options for saving or loading model states and handling external code | Option | Description | Default | Values | |-----------------------------------|--------------------------------|---------|-----------------| | `--force-download` | Force re-download cached files | `false` | `true`\|`false` | | `--trust-remote-code` | Allow custom Hugging Face code | `false` | `true`\|`false` | ### Generation parameters Controls for generation behavior. | Option | Description | Default | Values | |---------------------------------|--------------------------------------------------------------------------------------|---------|------------------------------------------| | `--enable-constrained-decoding` | Enable constrained generation | `false` | `true`\|`false` | | `--enable-echo` | Enable model echo | `false` | `true`\|`false` | | `--image_url` | URLs of images to include with prompt. Ignored if model doesn't support image inputs | `[]` | List of valid URLs | | `--rope-type` | RoPE type for GGUF weights | | `none`\|`normal`\|`neox` | | `--top-k` | Limit sampling to top K tokens | `1` | Positive integer (1 for greedy sampling) | --- ## MAX container import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import TutorialStack from '@site/src/components/TutorialStack'; The MAX container is our official Docker container that simplifies the process to deploy a GenAI model on an endpoint. The container includes the latest version of MAX and it integrates with orchestration tools like Kubernetes. Alternatively, you can also experiment with MAX on a local endpoint using the [`max serve`](/max/max-cli#serve) command. The result is basically the same because the MAX container is a containerized environment that runs `max serve` to create the endpoint you can interact with using our OpenAI-compatible [REST API](/max/api/serve). :::note Linux only The MAX container is currently not compatible with macOS. ::: ## Get started Here's how to start an endpoint with the MAX container: 1. Make sure you have [Docker installed](https://docs.docker.com/get-started/get-docker/). 2. Start the container and an endpoint for Llama 3: ```bash docker run --gpus=1 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF ``` It can take a few minutes to pull the container and then download and compile the model. When the endpoint is ready, you'll see a message that says this: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` 3. Open a new terminal and send a request using the `openai` Python API or `curl`: 1. Create a new virtual environment: ```sh mkdir quickstart && cd quickstart ``` ```sh python3 -m venv .venv/quickstart \ && source .venv/quickstart/bin/activate ``` 2. Install the OpenAI Python API: ```bash pip install openai ``` 3. Create the following file to send an inference request: ```python title="generate-text.py" from openai import OpenAI client = OpenAI( base_url="http://0.0.0.0:8000/v1", api_key="EMPTY", ) completion = client.chat.completions.create( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[ { "role": "user", "content": "Who won the world series in 2020?" }, ], ) print(completion.choices[0].message.content) ``` 4. Run it and you should see results like this: ```sh python generate-text.py ``` ```output The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988. ``` Run this command: ```sh curl -N http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "stream": true, "messages": [ {"role": "user", "content": "Who won the World Series in 2020?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` You should see results like this: ```output The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988. ``` For details about the OpenAI-compatible endpoint, see [our Serve API docs](/max/api/serve). To run a different model, change the `--model-path` to something else from [our model repository](https://builds.modular.com/?category=models). For information about the available containers, see the [Modular Docker Hub repositories](https://hub.docker.com/r/modular). ## Container options The `docker run` command above includes the bare minimum commands and options, but there are other `docker` options you might consider, plus several options to control features of the endpoint. ### Docker options - `--gpus`: If your system includes a compatible GPU, you must add the [`--gpus` option](https://docs.docker.com/reference/cli/docker/container/run/#gpus) in order for the container to access it. It doesn't hurt to include this even if your system doesn't have a [GPU compatible with MAX](/max/faq#gpu-requirements). - `--devices`: When deploying MAX on multiple GPUs, you must specify the ID of the GPUs to use. For example, to use four available GPUs, you should include the following: `--devices gpu:0,1,2,3`. When you don't specify a `--devices` option, MAX defaults to using the first available GPU it discovers (equivalent to `--devices gpu:0`). You can also optionally specify `--devices cpu`. - `-v`: We use the [`-v` option](https://docs.docker.com/reference/cli/docker/container/run/#volume) to save a cache of Hugging Face models to your local disk that we can reuse across containers. - `-p`: We use the [`-p` option](https://docs.docker.com/reference/cli/docker/container/run/#publish) to specify the exposed port for the endpoint. You also might need some environment variables (set with `--env`): - `HF_TOKEN`: This is required to access gated models on Hugging Face (after your account is granted access). For example: ```sh docker run \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=" \ -p 8000:8000 \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path mistralai/Mistral-7B-Instruct-v0.2 ``` Learn more about [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken) and how to create [Hugging Face access tokens](https://huggingface.co/docs/hub/en/security-tokens). - `HF_HUB_ENABLE_HF_TRANSFER`: Set this to `1` to enable faster model downloads from Hugging Face. For example: ```sh docker run \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --env "HF_HUB_ENABLE_HF_TRANSFER=1" \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF ``` Learn more about [`HF_HUB_ENABLE_HF_TRANSFER`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhubenablehftransfer). ### MAX options Following the container name in the `docker run` command, you must specify a model with `--model-path`, but there are other options you might need to configure the `max serve` behavior. To see all available options, see the [`max` CLI page](/max/max-cli#serve), because the MAX container is basically a wrapper around that tool. - `--model-path`: This is required to specify the model you want to deploy. To find other GenAI models that are compatible with MAX, check out our [list of models on MAX Builds](https://builds.modular.com/?category=models). - `--max-length`: Specifies the maximum length of the text sequence (including the input tokens). We mention this one here because it's often necessary to adjust the max length when you have trouble running a large model on a machine with limited memory. For the rest of the `max serve` options, see the [`max` CLI page](/max/max-cli#serve). ## Container contents The MAX container is based on the NVIDIA CUDA Deep Learning Container [version 12.5.0 base Ubuntu 22.04](https://hub.docker.com/layers/nvidia/cuda/12.5.0-base-ubuntu22.04/images/sha256-e58b22698c6f468de4dd32578d40821e30eae77251e18713ef986576d08ea825). There are multiple MAX container options, including: - [`max-nvidia-full`](/max/container#full-container) - [`max-nvidia-base`](/max/container#base-container) ### Full container The full MAX container (`max-nvidia-full`) includes MAX, as well as additional dependencies, such as PyTorch for GPU and cuDNN. The full MAX container is the default `max` CLI container and includes the following: - Ubuntu 22.04 - Python 3.12 - MAX 25.3 - PyTorch (GPU) 2.6.0 - cuDNN - CUDA 12.8 - NumPy - Hugging Face Transformers For more information on the full MAX container, see the [Docker Hub repository](https://hub.docker.com/r/modular/max-nvidia-full). ### Base container The base MAX container (`max-nvidia-base`) includes only the essentials for deploying MAX, offering faster downloads and fewer dependencies. It only requires the NVIDIA Driver, instead of all of CUDA, resulting in a much more optimized container. The base container includes: - Ubuntu 22.04 - Python 3.12 - MAX 25.3 - PyTorch (CPU) 2.5 - CUDA 12.8 (Requires NVIDIA Driver only) - NumPy - Hugging Face Transformers For more information on the full MAX container, see the [Docker Hub repository](https://hub.docker.com/r/modular/max-nvidia-base). ## Recommended cloud instances For best performance and compatibility with the [available models on MAX Builds](https://builds.modular.com/?category=models), we recommend that you deploy the MAX container on a cloud instance with a GPU that meets the [MAX system requirements](/max/faq#system-requirements). The following are some cloud-based GPU instances and virtual machines that we recommend. AWS instances: - [P5](https://aws.amazon.com/ec2/instance-types/p5/) instance family (H100 GPU) - [P4d](https://aws.amazon.com/ec2/instance-types/p4/) instance family (A100 GPU) - [G5](https://aws.amazon.com/ec2/instance-types/g5/) instance family (A10G GPU) - [G6](https://aws.amazon.com/ec2/instance-types/g6/) instance family (L4 GPU) - [G6e](https://aws.amazon.com/ec2/instance-types/g6e/) instance family (L40S GPU) GCP instances: - [A3](https://cloud.google.com/compute/docs/gpus#a3-series) machine series (H100 GPU) - [A2](https://cloud.google.com/compute/docs/gpus#a100-gpus) machine series (A100 GPU) - [G2](https://cloud.google.com/compute/docs/gpus#l4-gpus) machine series (L4 GPU) Azure instances: - [NCads_H100_v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#ncads_h100_v5-series) virtual machine - [NCCads_H100_v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#nccads_h100_v5-series) virtual machine - [ND_H100_v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-family#nd_h100_v5-series) virtual machine - [NC_A100_v4-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#nc_a100_v4-series) virtual machine - [NDm_A100_v4-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-family#ndm_a100_v4-series) virtual machine - [ND_A100_v4-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-family#nd_a100_v4-series) virtual machine - [NVads-A10 v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nv-family#nvads-a10-v5-series) virtual machine ## Logs The MAX container writes logs to stdout, which you can consume and view via your cloud provider's platform (for example, [with AWS CloudWatch](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html)). Console log level is `INFO` by default. You can modify the log level using the `MAX_SERVE_LOGS_CONSOLE_LEVEL` environment variable. It accepts the following log levels (in order of increasing verbosity): `CRITICAL`, `ERROR`, `WARNING`, `INFO`, `DEBUG`. For example: ```bash docker run docker.modular.com/modular/max-nvidia-full:latest \ -e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \ ... ``` For readability, logs default to unstructured text, but you can emit them with structured JSON by adding the `MODULAR_STRUCTURED_LOGGING=1` environment variable. ## Metrics The MAX container exposes a `/metrics` endpoint that follows the [Prometheus](https://prometheus.io/docs/introduction/overview/) text format. You can scrape the metrics listed below using Prometheus or another collection service. These are raw metrics and it's up to you to compute the desired time series and aggregations. For example, we provide a count for output tokens (`maxserve_num_output_tokens_total`), which you can use to calculate the output tokens per second (OTP/s). Here are all the available metrics: - `maxserve_request_time_milliseconds`: Histogram of time spent handling each request (total inference time, or TIT), in milliseconds. - `maxserve_input_processing_time_milliseconds`: Histogram of input processing time (IPT), in milliseconds. - `maxserve_output_processing_time_milliseconds`: Histogram of output generation time (OGT), in milliseconds. - `maxserve_time_to_first_token_milliseconds`: Histogram of time to first token (TTFT), in milliseconds. - `maxserve_num_input_tokens_total`: Total number of input tokens processed so far. - `maxserve_num_output_tokens_total`: Total number of output tokens processed so far. - `maxserve_request_count_total`: Total requests since start. - `maxserve_num_requests_running`: Number of requests currently running. ### Telemetry In addition to sharing these metrics via the `/metrics` endpoint, the MAX container actively sends the metrics to Modular via push telemetry (using OpenTelemetry). :::note None of the telemetry includes personally identifiable information (PII). ::: This telemetry is anonymous and helps us quickly identify problems and build better products for you. Without this telemetry, we would rely solely on user-submitted bug reports, which are limited and would severely limit our performance insights. However, if you don't want to share this data with Modular, you can disable telemetry in your container. To disable telemetry, enable the `MAX_SERVE_DISABLE_TELEMETRY` environment variable when you start your MAX container. For example: ```bash docker run docker.modular.com/modular/max-nvidia-full:latest \ -e MAX_SERVE_DISABLE_TELEMETRY=1 \ ... ``` #### Deployment and user ID Again, the telemetry is completely anonymous by default. But if you'd like to share some information to help our team assist you in understanding your deployment performance, you can add some identity information to the telemetry with these environment variables: - `MAX_SERVE_DEPLOYMENT_ID`: Your application name. - `MODULAR_USER_ID`: Your company name. For example: ```bash docker run docker.modular.com/modular/max-nvidia-full:latest \ -e MAX_SERVE_DEPLOYMENT_ID='Project name' \ -e MODULAR_USER_ID='Example Inc.' \ ... ``` ## License The MAX container is released under the [NVIDIA Deep Learning Container license](https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf). ## Next steps export const tutorials = [ 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes', ]; --- ## max_finite `max_finite[dtype: DType]() -> SIMD[dtype, 1]` Returns the maximum finite value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The maximum representable value of the type. Does not include infinity for floating-point types. --- ## max_int__ `max_int__(gpr: Int)` UI16 matrix multiply. --- ## max_or_inf `max_or_inf[dtype: DType]() -> SIMD[dtype, 1]` Returns the maximum (potentially infinite) value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The maximum representable value of the type. Can include infinity for floating-point types. --- ## max_pool `max_pool[type: DType, int_type: DType, rank: Int = 4](input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)` Computes fp32 pooling. **Args:** * ​input (`NDBuffer[type, rank, origin]`): Batched image input to the pool2d operator. * ​filter (`NDBuffer[int_type, 1, origin]`): Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`NDBuffer[int_type, 1, origin]`): Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`NDBuffer[int_type, 1, origin]`): Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`NDBuffer[int_type, 1, origin]`): Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`NDBuffer[type, rank, origin]`): Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## max_pool_gpu `max_pool_gpu[type: DType, int_type: DType, rank: Int = 4](ctx: DeviceContext, input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)` Computes max pooling on GPU. **Args:** * ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution. * ​input (`NDBuffer[type, rank, origin]`): (On device) Batched image input to the pool2d operator. * ​filter (`NDBuffer[int_type, 1, origin]`): (On host) Filter size on height and width dimensions with assumed tuple def (filter\_h, filter\_w). * ​strides (`NDBuffer[int_type, 1, origin]`): (On host) Strides on height and width dimensions with assumed tuple def (stride\_h, stride\_w). * ​dilations (`NDBuffer[int_type, 1, origin]`): (On host) Dilations on height and width dimensions with assumed tuple def (dilation\_h, dilation\_w). * ​paddings (`NDBuffer[int_type, 1, origin]`): (On host) Paddings on height and width dimensions with assumed tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)). * ​output (`NDBuffer[type, rank, origin]`): (On device) Pre-allocated output tensor space. * ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding. --- ## maybe_uninitialized ## Structs * [​`UnsafeMaybeUninitialized`](/mojo/stdlib/memory/maybe_uninitialized/UnsafeMaybeUninitialized): A memory location that may or may not be initialized. --- ## mbarrier_arrive `mbarrier_arrive[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Int` Signal thread arrival at a shared memory barrier. Records that the calling thread has reached the barrier synchronization point. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. **Args:** * ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. **Returns:** An integer representing the current state of the memory barrier. --- ## mbarrier_arrive_expect_tx_shared `mbarrier_arrive_expect_tx_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], tx_count: SIMD[int32, 1])` Configure a shared memory barrier to expect additional async transactions. Updates the current phase of the memory barrier to track completion of additional asynchronous transactions. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The type of the memory barrier. **Args:** * ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​tx\_count (`SIMD[int32, 1]`): Number of expected transactions to track. --- ## mbarrier_init `mbarrier_init[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], num_threads: SIMD[int32, 1])` Initialize a shared memory barrier for synchronizing multiple threads. Sets up a memory barrier in shared memory that will be used to synchronize the specified number of threads. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. **Args:** * ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory location for the barrier. * ​num\_threads (`SIMD[int32, 1]`): Number of threads that will synchronize on this barrier. --- ## mbarrier_test_wait `mbarrier_test_wait[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], state: Int) -> Bool` Test if all threads have arrived at the memory barrier. Non-blocking check to see if all participating threads have reached the barrier. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The data type stored at the barrier location. **Args:** * ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​state (`Int`): Expected state of the memory barrier. **Returns:** True if all threads have arrived, False otherwise. --- ## mbarrier_try_wait_parity_shared `mbarrier_try_wait_parity_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], phase: SIMD[int32, 1], ticks: SIMD[int32, 1])` Wait for completion of a barrier phase with timeout. Waits for the shared memory barrier to complete the specified phase, or until the timeout period expires. Only supported on NVIDIA GPUs. **Parameters:** * ​type (`AnyType`): The type of the memory barrier. **Args:** * ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier. * ​phase (`SIMD[int32, 1]`): Phase number to wait for. * ​ticks (`SIMD[int32, 1]`): Timeout period in nanoseconds. --- ## mean `mean(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the mean value of the elements in a buffer. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer of elements for which the mean is computed. **Returns:** The mean value of the elements in the given buffer. `mean[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the mean across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `mean[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, output_shape: IndexList[size], context: DeviceContextPtr = DeviceContextPtr())` Computes the mean across the input and output shape. This performs the mean computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results' domain is `output_shape` which are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the mean on. * ​output\_shape (`IndexList[size]`): The output shape. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## memcmp `memcmp[type: AnyType, address_space: AddressSpace](s1: UnsafePointer[type, address_space=address_space], s2: UnsafePointer[type, address_space=address_space], count: Int) -> Int` Compares two buffers. Both strings are assumed to be of the same length. **Parameters:** * ​type (`AnyType`): The element type. * ​address\_space (`AddressSpace`): The address space of the pointer. **Args:** * ​s1 (`UnsafePointer[type, address_space=address_space]`): The first buffer address. * ​s2 (`UnsafePointer[type, address_space=address_space]`): The second buffer address. * ​count (`Int`): The number of elements in the buffers. **Returns:** Returns 0 if the bytes strings are identical, 1 if s1 > s2, and -1 if s1 --- ## memcpy `memcpy[T: AnyType](dest: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], src: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], count: Int)` Copies a memory area. **Parameters:** * ​T (`AnyType`): The element type. **Args:** * ​dest (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The destination pointer. * ​src (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The source pointer. * ​count (`Int`): The number of elements to copy. --- ## memcpy_or_fuse `memcpy_or_fuse[rank: Int, type: DType, epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](dest_data: UnsafePointer[SIMD[int8, 1]], out_byte_offset: Int, src_data: UnsafePointer[SIMD[int8, 1]], n: Int, out_shape: IndexList[rank, element_type=element_type])` --- ## memory Implements `parallel_memcpy`. You can import these APIs from the `algorithm` package. For example: ```mojo from algorithm import parallel_memcpy ``` ## Functions * [​`parallel_memcpy`](/mojo/stdlib/algorithm/memory/parallel_memcpy): Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements. --- ## memory ## Functions * [​`clobber_memory`](/mojo/stdlib/benchmark/memory/clobber_memory): Forces all pending memory writes to be flushed to memory. --- ## memory This module provides GPU memory operations and utilities. The module implements low-level memory operations for GPU programming, with a focus on: * Memory address space abstractions (global, shared, constant) * Cache control operations and policies * Memory access patterns and optimizations * Memory alignment and pointer manipulation It provides a unified interface for memory operations across different GPU architectures, with specialized implementations for NVIDIA and AMD GPUs where needed. The module is designed for performance-critical code and requires careful usage to achieve optimal memory access patterns and cache utilization. ## Aliases ### `AddressSpace` `alias AddressSpace = _GPUAddressSpace` ## Structs * [​`CacheEviction`](/mojo/stdlib/gpu/memory/CacheEviction): Represents cache eviction policies for GPU memory operations. * [​`CacheOperation`](/mojo/stdlib/gpu/memory/CacheOperation): Represents different GPU cache operation policies. * [​`Consistency`](/mojo/stdlib/gpu/memory/Consistency): Represents memory consistency models for GPU memory operations. * [​`Fill`](/mojo/stdlib/gpu/memory/Fill): Represents memory fill patterns for GPU memory operations. * [​`ReduceOp`](/mojo/stdlib/gpu/memory/ReduceOp): Represents reduction operations for parallel reduction algorithms. ## Functions * [​`async_copy`](/mojo/stdlib/gpu/memory/async_copy): Asynchronously copies data from global memory to shared memory. * [​`async_copy_commit_group`](/mojo/stdlib/gpu/memory/async_copy_commit_group): Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group. * [​`async_copy_wait_all`](/mojo/stdlib/gpu/memory/async_copy_wait_all): Waits for completion of all committed cp.async-groups. * [​`async_copy_wait_group`](/mojo/stdlib/gpu/memory/async_copy_wait_group): Waits for the completion of `n` most recently committed cp.async-groups. * [​`cp_async_bulk_tensor_global_shared_cta`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_global_shared_cta): Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. * [​`cp_async_bulk_tensor_reduce`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_reduce): Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism. * [​`cp_async_bulk_tensor_shared_cluster_global`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global): Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory. * [​`cp_async_bulk_tensor_shared_cluster_global_multicast`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global_multicast): Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster. * [​`external_memory`](/mojo/stdlib/gpu/memory/external_memory): Gets a pointer to dynamically allocated external memory. * [​`fence_mbarrier_init`](/mojo/stdlib/gpu/memory/fence_mbarrier_init): Creates a memory fence after mbarrier initialization. * [​`fence_proxy_tensormap_generic_sys_acquire`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_acquire): Acquires a system-wide memory fence for tensor map operations. * [​`fence_proxy_tensormap_generic_sys_release`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_release): Releases the system-wide memory fence for tensor map operations. * [​`load`](/mojo/stdlib/gpu/memory/load): Loads data from global memory into a SIMD vector. * [​`multimem_ld_reduce`](/mojo/stdlib/gpu/memory/multimem_ld_reduce): Performs a vectorized load-reduce operation using NVIDIA's multimem feature. * [​`multimem_st`](/mojo/stdlib/gpu/memory/multimem_st): Stages an inline multimem.st instruction. * [​`tma_store_fence`](/mojo/stdlib/gpu/memory/tma_store_fence): Establishes a memory fence for shared memory stores in TMA operations. --- ## memory The memory package provides several pointer types, as well as utility functions for dealing with memory. ## Modules * [​`arc`](/mojo/stdlib/memory/arc/): Reference-counted smart pointers. * [​`maybe_uninitialized`](/mojo/stdlib/memory/maybe_uninitialized/): * [​`memory`](/mojo/stdlib/memory/memory/): Defines functions for memory manipulations. * [​`owned_pointer`](/mojo/stdlib/memory/owned_pointer/): Implements `OwnedPointer`, a safe, single-ownership smart pointer. * [​`pointer`](/mojo/stdlib/memory/pointer/): Implements the Pointer type. * [​`span`](/mojo/stdlib/memory/span/): Implements the `Span` type. * [​`unsafe`](/mojo/stdlib/memory/unsafe/): Provides utility functions for unsafe manipulation of SIMD values. * [​`unsafe_pointer`](/mojo/stdlib/memory/unsafe_pointer/): Implement a generic unsafe pointer type. --- ## memory Defines functions for memory manipulations. You can import these APIs from the `memory` package. For example: ```mojo from memory import memcmp ``` ## Functions * [​`memcmp`](/mojo/stdlib/memory/memory/memcmp): Compares two buffers. Both strings are assumed to be of the same length. * [​`memcpy`](/mojo/stdlib/memory/memory/memcpy): Copies a memory area. * [​`memset`](/mojo/stdlib/memory/memory/memset): Fills memory with the given value. * [​`memset_zero`](/mojo/stdlib/memory/memory/memset_zero): Fills memory with zeros. * [​`stack_allocation`](/mojo/stdlib/memory/memory/stack_allocation): Allocates data buffer space on the stack given a data type and number of elements. --- ## MemoryElement `struct MemoryElement[dtype: DType, layout: Layout, address_space: AddressSpace, alignment: Int, /, *, index_type: DType = _get_index_type(layout, address_space)]` Represents data in memory organized according to a specific layout. The `MemoryElement` struct provides a high-level interface for accessing data in memory with a specific layout. It encapsulates a pointer to the memory location and the runtime layout information needed to access the data correctly. This abstraction enables efficient memory operations that respect the underlying memory organization, supporting vectorized loads and stores while handling different memory layouts transparently. ## Parameters * ​dtype (`DType`): The data type of the elements. * ​layout (`Layout`): The memory layout describing how elements are organized. * ​address\_space (`AddressSpace`): The memory address space where the data is located. * ​alignment (`Int`): The memory alignment requirement for the data. * ​index\_type (`DType`): The integer type of the index pointing to each memory element. ## Fields * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location where the data is stored. This pointer provides access to the underlying memory with the specified address space and alignment requirements. It points to the first element of the data structure in memory. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): Runtime layout information used for memory access calculations. This field stores the runtime layout information needed to compute memory offsets for accessing elements according to the specified layout pattern. It handles both compile-time known dimensions and runtime-determined dimensions. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])` Initializes a `MemoryElement` with the given pointer and runtime layout. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location of the element. * ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access. ### `load` `load(self, out result: Element[dtype, layout, index_type])` Loads data from memory according to the specified layout. This method performs a layout-aware load operation, reading data from memory following the access patterns defined by the layout. It optimizes memory reads based on the layout's stride patterns to maximize performance. The method leverages the underlying `Element.load` implementation which handles different memory layout patterns including contiguous and strided access. **Returns:** An `Element` containing the loaded data organized according to the layout. ### `store` `store(self, src: Element[dtype, layout, index_type])` Stores element data to the memory location of this MemoryElement. This method performs a layout-aware store operation, writing data to memory following the access patterns defined by the layout. It optimizes memory writes based on the layout's stride patterns to maximize performance. The method delegates to the `Element.store` implementation which handles different memory layout patterns including vectorized stores for contiguous memory and element-by-element stores for non-contiguous layouts. **Args:** * ​src (`Element[dtype, layout, index_type]`): The `Element` containing the data to store. ### `transfer` `transfer(self, src: MemoryElement[dtype, layout, address_space, alignment, index_type=index_type])` Transfers data from another `MemoryElement` to this one. This method efficiently transfers data between memory locations with potentially different layouts and data types. It performs the following operations: 1. Loads data from the source `MemoryElement` using its layout 2. Converts the data to the destination data type if necessary 3. Stores the converted data to the destination memory location using its layout This provides a high-performance way to copy and convert data between different memory representations while respecting both source and destination memory layouts. **Args:** * ​src (`MemoryElement[dtype, layout, address_space, alignment, index_type=index_type]`): The source `MemoryElement` to transfer data from. --- ## memset `memset[type: AnyType, address_space: AddressSpace](ptr: UnsafePointer[type, address_space=address_space], value: SIMD[uint8, 1], count: Int)` Fills memory with the given value. **Parameters:** * ​type (`AnyType`): The element dtype. * ​address\_space (`AddressSpace`): The address space of the pointer. **Args:** * ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill. * ​value (`SIMD[uint8, 1]`): The value to fill with. * ​count (`Int`): Number of elements to fill (in elements, not bytes). --- ## memset_zero `memset_zero[type: AnyType, address_space: AddressSpace, //](ptr: UnsafePointer[type, address_space=address_space], count: Int)` Fills memory with zeros. **Parameters:** * ​type (`AnyType`): The element type. * ​address\_space (`AddressSpace`): The address space of the pointer. **Args:** * ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill. * ​count (`Int`): Number of elements to fill (in elements, not bytes). `memset_zero[dtype: DType, address_space: AddressSpace, //, *, count: Int](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space])` Fills memory with zeros. **Parameters:** * ​dtype (`DType`): The element type. * ​address\_space (`AddressSpace`): The address space of the pointer. * ​count (`Int`): Number of elements to fill (in elements, not bytes). **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill. --- ## merge `merge[type: DType, out_idx_type: DType, rank: Int](mut buf_keys: NDBuffer[type, rank, origin], mut buf_ids: NDBuffer[out_idx_type, rank, origin], start: Int, mid: Int, end: Int)` Merge two sorted subarrays into one sorted array. --- ## merge_sort_recursive `merge_sort_recursive[type: DType, out_idx_type: DType, rank: Int](mut buf_keys: NDBuffer[type, rank, origin], mut buf_ids: NDBuffer[out_idx_type, rank, origin], start: Int, end: Int)` Recursive merge sort implementation. --- ## mha ## Functions * [​`flash_attention`](./flash_attention): * [​`flash_attention_dispatch`](./flash_attention_dispatch): * [​`flash_attention_hw_supported`](./flash_attention_hw_supported): * [​`get_mha_decoding_num_partitions`](./get_mha_decoding_num_partitions): * [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer): * [​`mha`](./mha): * [​`mha_decoding`](./mha_decoding): * [​`mha_decoding_single_batch`](./mha_decoding_single_batch): Flash attention v2 algorithm. * [​`mha_decoding_single_batch_pipelined`](./mha_decoding_single_batch_pipelined): Flash attention v2 algorithm. * [​`mha_gpu_naive`](./mha_gpu_naive): * [​`mha_single_batch`](./mha_single_batch): MHA for token gen where seqlen = 1 and num\_keys >= 1. * [​`mha_single_batch_pipelined`](./mha_single_batch_pipelined): MHA for token gen where seqlen = 1 and num\_keys >= 1. * [​`mha_splitk_reduce`](./mha_splitk_reduce): * [​`scale_and_mask_helper`](./scale_and_mask_helper): --- ## mha `mha[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, num_keys_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)` --- ## mha_cross ## Functions * [​`mha_cross_gpu_naive`](./mha_cross_gpu_naive): Naive cross attention on GPU. --- ## mha_cross_gpu_naive `mha_cross_gpu_naive[cache_t: KVCacheT, mask_t: MHAMask, type: DType, q_shape: DimList, //, rank: Int](output: NDBuffer[type, rank, MutableAnyOrigin, shape, strides], q: NDBuffer[type, rank, MutableAnyOrigin, q_shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], q_max_seq_len: Int, k: cache_t, v: cache_t, kv_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], mask_functor: mask_t, scale: SIMD[float32, 1], ctx: DeviceContext)` Naive cross attention on GPU. Note that this assumes ragged tensor inputs and uses a mask functor. Computes: (1) Transpose (Q) BSHD -> BHSD; (2) Transpose (K) BSHD -> BHSD; (3) Transpose (V) BSHD -> BHSD; (4) P = Bmm(Q, K), P is also called "score"; (5) P = P \* scale + mask; (6) P = softmax(P); (7) O = Bmm(P, V) (8) Output = Transpose(O). B, S, H, D denote batch size, sequence length, head count and depth, respectively. (1), (2), (3) happens while loading the data into shared memory. (8) happens when writing output to global memory. All inputs (query, key, and value) must have BSHD layout. The mask can be BSS or BHSS. This kernel also handles grouped attention optimization. In this case the shape of K and V are BShD where h = H / num\_groups. --- ## mha_decoding `mha_decoding[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)` --- ## mha_decoding_single_batch `mha_decoding_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` Flash attention v2 algorithm. --- ## mha_decoding_single_batch_pipelined `mha_decoding_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` Flash attention v2 algorithm. --- ## mha_gpu_naive `mha_gpu_naive[output_type: DType, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, rank: Int, //, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False](q: NDBuffer[type, rank, origin, shape, strides], k: k_t, v: v_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)` `mha_gpu_naive[q_type: DType, k_type: DType, v_type: DType, output_type: DType, rank: Int, mask_type: DType, mask_rank: Int, //](q: NDBuffer[q_type, rank, origin, shape, strides], k: NDBuffer[k_type, rank, origin, shape, strides], v: NDBuffer[v_type, rank, origin, shape, strides], mask: NDBuffer[mask_type, mask_rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, rank, origin, shape, strides], scale: SIMD[float32, 1], batch_size: Int, seq_len: Int, num_keys: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)` `mha_gpu_naive[q_type: DType, output_type: DType, cache_t: KVCacheT, mask_t: MHAMask, rank: Int, //, ragged: Bool = False](q: NDBuffer[q_type, rank, origin, shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)` --- ## mha_mask ## Aliases ### `MASK_VALUE` `alias MASK_VALUE = -10000` ## Structs * [​`AndMask`](./AndMask): Mask that's the AND of two masks. * [​`CausalMask`](./CausalMask): MHA causal mask ensures a token is only affected by previous tokens. * [​`ChunkedMask`](./ChunkedMask): Mask implementing Chunked attention. * [​`MaskName`](./MaskName): A tile's masking status. * [​`MaterializedMask`](./MaterializedMask): Mask that's backed by a materialized tensor. * [​`NullMask`](./NullMask): Mask that's effectively a noop. * [​`OrMask`](./OrMask): Mask that's the OR of two masks. * [​`SlidingWindowCausalMask`](./SlidingWindowCausalMask): Mask implementing Sliding Window attention. * [​`TileMaskStatus`](./TileMaskStatus): A tile's masking status. ## Traits * [​`MHAMask`](./MHAMask): The MHAMask trait describes masks for MHA kernels, such as the causal mask. ## Functions * [​`ChunkedCausalMask`](./ChunkedCausalMask): Mask implementing Chunked Causal attention for Llama4 models. --- ## mha_operand ## Structs * [​`KVCacheMHAOperand`](./KVCacheMHAOperand): An implementation for `mo.opaque` KVCacheT arguments to MHA kernels. * [​`NDBufferMHAOperand`](./NDBufferMHAOperand): An implementation for NDBuffer arguments to MHA kernels. * [​`RaggedMHAOperand`](./RaggedMHAOperand): An implementation for ragged NDBuffer arguments to MHA kernels. ## Traits * [​`MHAOperand`](./MHAOperand): This serves as the trait to support arguments to our MHA kernel. --- ## mha_score_mod ## Structs * [​`AlibiScoreMod`](./AlibiScoreMod): AlibiScoreMod adds the appropriate ALiBi constant bias to attention score. * [​`IdentityScoreMod`](./IdentityScoreMod): IdentityScoreMod simply returns attention score. ## Traits * [​`ScoreModTrait`](./ScoreModTrait): The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias. --- ## mha_single_batch `mha_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` MHA for token gen where seqlen = 1 and num\_keys >= 1. The general data layout and steps conform to flash attention. Two exceptions: 1 Partition across B, H, and num\_keys (TODO). The last one is split-K and will need a separate reduction kernel at the end. 2 Frist bmm becomes gemv and second bmm becomes gevm. TODO: use more optimized kernels for them --- ## mha_single_batch_pipelined `mha_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` MHA for token gen where seqlen = 1 and num\_keys >= 1. The general data layout and steps conform to flash attention. Two exceptions: 1 Partition across B, H, and num\_keys (TODO). The last one is split-K and will need a separate reduction kernel at the end. 2 Frist bmm becomes gemv and second bmm becomes gevm. TODO: use more optimized kernels for them --- ## mha_sm90 ## Structs * [​`DynamicInt`](./DynamicInt): * [​`MHAPosition`](./MHAPosition): Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`. * [​`NoPartition`](./NoPartition): * [​`SplitKPartition`](./SplitKPartition): * [​`StaticInt`](./StaticInt): ## Traits * [​`MHAPartitionScheme`](./MHAPartitionScheme): * [​`OptionallyStaticInt`](./OptionallyStaticInt): ## Functions * [​`mha_sm90_dispatch`](./mha_sm90_dispatch): * [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer): --- ## mha_sm90_dispatch `mha_sm90_dispatch[k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, max_prompt_len_t: OptionallyStaticInt, partition_t: MHAPartitionScheme, //, config: MHAConfig, group: Int, use_score_mod: Bool, ragged: Bool, _is_cache_length_accurate: Bool](output: UnsafePointer[SIMD[output_type, 1]], q: UnsafePointer[SIMD[type, 1]], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len_arg: max_prompt_len_t, max_cache_valid_length_arg: Int, scale: SIMD[float32, 1], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], batch_size_arg: Int, partition: partition_t, ctx: DeviceContext)` --- ## mha_splitk_reduce `mha_splitk_reduce[output_type: DType, depth: UInt, num_heads: UInt, num_threads: UInt, group: UInt = UInt(1), use_exp2: Bool = False](intermediate_ptr: UnsafePointer[SIMD[output_type, 1]], output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], batch_size: Int, num_partitions: Int)` --- ## mha_tile_scheduler ## Structs * [​`MHASchedule`](./MHASchedule): * [​`MHASchedulerSynchronization`](./MHASchedulerSynchronization): * [​`MHATileState`](./MHATileState): * [​`MHATileSummary`](./MHATileSummary): * [​`QueuedTileScheduler`](./QueuedTileScheduler): If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`. * [​`SeqInfo`](./SeqInfo): * [​`TileScheduler`](./TileScheduler): * [​`TransientScheduler`](./TransientScheduler): * [​`WorkInfo`](./WorkInfo): ## Traits * [​`MHATileScheduler`](./MHATileScheduler): --- ## mha_utils ## Aliases ### `callback_fn_type` `alias callback_fn_type = fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None` ### `is_sm90` `alias is_sm90 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90"))` ## Structs * [​`FlashAttentionAlgorithm`](./FlashAttentionAlgorithm): * [​`MHAConfig`](./MHAConfig): ## Functions * [​`dispatch_mask_and_score_mod`](./dispatch_mask_and_score_mod): * [​`dispatch_materialized_mask_and_score_mod`](./dispatch_materialized_mask_and_score_mod): * [​`get_start_and_end_for_partitions`](./get_start_and_end_for_partitions): Calculate start and end indices for a partition. --- ## MHAConfig `@register_passable(trivial)` `struct MHAConfig` ## Fields * ​type (`DType`): * ​num\_heads (`UInt`): * ​depth (`UInt`): * ​num\_queries\_per\_block (`UInt`): * ​num\_keys\_per\_block (`UInt`): * ​BK (`UInt`): * ​WM (`UInt`): * ​WN (`UInt`): * ​num\_pipeline\_stages (`UInt`): * ​k\_group\_size (`UInt`): * ​algorithm (`FlashAttentionAlgorithm`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(type: DType, num_heads: UInt, depth: UInt, num_queries_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_keys_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), BK: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WM: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WN: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_pipeline_stages: UInt = UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), k_group_size: UInt = UInt(1), algorithm: FlashAttentionAlgorithm = FlashAttentionAlgorithm()) -> Self` ### `block_m` `block_m(self) -> UInt` ### `block_n` `block_n(self) -> UInt` ### `block_k` `block_k(self) -> UInt` ### `warp_m` `warp_m(self) -> UInt` ### `warp_n` `warp_n(self) -> UInt` ### `num_warps_m` `num_warps_m(self) -> UInt` ### `num_warps_n` `num_warps_n(self) -> UInt` ### `num_consumer_threads` `num_consumer_threads(self) -> UInt` ### `num_producer_threads` `num_producer_threads[producer_consumer_kernel: Bool = False](self) -> UInt` ### `num_threads` `num_threads[producer_consumer_kernel: Bool = False](self) -> UInt` ### `q_smem_size` `q_smem_size(self, sm_90: Bool = False) -> UInt` ### `kv_smem_size` `kv_smem_size(self, sm_90: Bool = False) -> UInt` ### `k_smem_size` `k_smem_size(self, sm_90: Bool = False) -> UInt` ### `v_smem_size` `v_smem_size(self, sm_90: Bool = False) -> UInt` ### `p_smem_size` `p_smem_size(self) -> UInt` ### `warp_scratch_smem_size` `warp_scratch_smem_size(self) -> UInt` ### `shared_mem_bytes` `shared_mem_bytes[shared_kv: Bool = False, sm_90: Bool = False](self) -> UInt` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## MHAMask The MHAMask trait describes masks for MHA kernels, such as the causal mask. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask` Does the mask require `log2e` to be applied after the mask, or can it be fused with the scaling? ### `mask_out_of_bound` `alias mask_out_of_bound` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds` Is the mask safe to read out of bounds? ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` Return mask vector at given coordinates. Arguments: coord is (seq\_id, head, q\_idx, k\_idx) score\_vec is at `coord` of the score matrix The functor could capture an mask tensor and add to the score e.g. Replit. ### `status` `status[*, element_type: DType = uint32](self: _Self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` Given a tile's index range, return its masking status. --- ## MHAOperand This serves as the trait to support arguments to our MHA kernel. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `type` `alias type` ## Methods ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self: _Self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]` ### `cache_length` `cache_length(self: _Self, batch_idx: Int) -> Int` Returns the length of the cache for a given batch index. ### `max_context_length` `max_context_length(self: _Self) -> SIMD[uint32, 1]` Returns the maximum cache length in a given batch index. --- ## MHAPartitionScheme ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `accum_dtype` `alias accum_dtype` ### `do_partition` `alias do_partition` ## Methods ### `num_partitions` `num_partitions(self: _Self) -> SIMD[uint32, 1]` ### `get_exp_sum_qk_max_pointer` `get_exp_sum_qk_max_pointer(self: _Self) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "accum_dtype"), 1]]` --- ## MHAPosition `@register_passable(trivial)` `struct MHAPosition[BM: Int, BN: Int, depth: Int, num_heads: Int, group: Int, decoding: Bool]` Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`. ## Fields * ​q\_out\_offset (`Int`): * ​num\_keys (`SIMD[uint32, 1]`): * ​start\_pos (`SIMD[uint32, 1]`): * ​seq\_len (`SIMD[uint32, 1]`): * ​head\_idx (`SIMD[uint32, 1]`): * ​prompt\_offset (`SIMD[uint32, 1]`): * ​prompt\_idx (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `q_output_gmem_layout` `alias q_output_gmem_layout = __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1))` ### `q_stride` `alias q_stride = depth if decoding else (depth * num_heads)` ## Methods ### `__init__` `__init__(q_out_offset: Int, num_keys: SIMD[uint32, 1], start_pos: SIMD[uint32, 1], seq_info: SeqInfo) -> Self` ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `q_head_idx` `q_head_idx(self) -> SIMD[uint32, 1]` ### `kv_head_idx` `kv_head_idx(self) -> SIMD[uint32, 1]` ### `write_to` `write_to[W: Writer](self, mut writer: W)` ### `q_tile_num_rows` `q_tile_num_rows(self) -> SIMD[uint32, 1]` ### `q_out_gmem_tensor` `q_out_gmem_tensor[dtype: DType](self, ptr: UnsafePointer[SIMD[dtype, 1]]) -> LayoutTensor[dtype, __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1)), MutableAnyOrigin, layout_int_type=int32, linear_idx_type=int32, masked=True]` ### `mask_status` `mask_status[mask_t: MHAMask](self, mask: mask_t, kv_tile_start_row: SIMD[uint32, 1]) -> TileMaskStatus` ### `exp_sum_qk_max_ptr` `exp_sum_qk_max_ptr[partition_t: MHAPartitionScheme](self, partition: partition_t, batch_size: SIMD[uint32, 1]) -> Tuple[UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]], UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]]]` ### `get_start_and_end_for_partitions` `get_start_and_end_for_partitions[partition_t: MHAPartitionScheme, //, BN: Int](self, partition: partition_t) -> Tuple[SIMD[uint32, 1], SIMD[uint32, 1]]` --- ## MHASchedule `@register_passable(trivial)` `struct MHASchedule` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DEFAULT` `alias DEFAULT = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))` ### `PROMPT_ROTATE` `alias PROMPT_ROTATE = MHASchedule(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## MHASchedulerSynchronization `@register_passable(trivial)` `struct MHASchedulerSynchronization` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALL` `alias ALL = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](2))` ### `DEFAULT` `alias DEFAULT = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))` ### `NONE` `alias NONE = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](0))` ### `PRODUCER` `alias PRODUCER = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` --- ## MHATileScheduler ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance` ### `mha_schedule` `alias mha_schedule` The MHATileScheduler trait describes a schedule for the persistent kernel. ## Methods ### `get_current_work_info` `get_current_work_info(self: _Self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` Returns the current `WorkInfo`. ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self: _Self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` Advance state to the next work item. `func` must return a `Bool` indicating whether there is more work. Returns `True` if there is more work. ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` Return the grid\_dim required for the kernel. ### `initial_state` `initial_state(self: _Self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` Create the initial state object. ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self: _Self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## MHATileState `@register_passable(trivial)` `struct MHATileState` ## Fields * ​idx (`SIMD[uint32, 1]`): * ​sidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)]`): * ​max\_idx (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(idx: SIMD[uint32, 1], sidx_ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], max_idx: SIMD[uint32, 1]) -> Self` ### `is_valid` `is_valid(self, idx: SIMD[uint32, 1]) -> Bool` `is_valid(self) -> Bool` --- ## MHATileSummary `@register_passable(trivial)` `struct MHATileSummary` ## Fields * ​batch\_size (`SIMD[uint32, 1]`): * ​max\_num\_prompt\_tiles (`SIMD[uint32, 1]`): * ​valid\_length (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​max\_seq\_len (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1], valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self` ### `get_current_work_info` `get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo` `get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: MHATileState) -> WorkInfo` ### `unsafe_get_current_work_info` `unsafe_get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo` ### `max_idx` `max_idx(self, num_heads: SIMD[uint32, 1]) -> SIMD[uint32, 1]` ### `grid_dim` `static grid_dim[num_heads: SIMD[uint32, 1]](max_num_prompt_tiles: SIMD[uint32, 1], batch_size: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `seq_info` `seq_info[ragged: Bool](self, work: WorkInfo) -> SeqInfo` ### `unsafe_seq_info` `unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> SeqInfo` `unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, state: MHATileState) -> SeqInfo` --- ## MicroKernelShape `@register_passable(trivial)` `struct MicroKernelShape` Record describing the inner kernel shape. ## Fields * ​simd\_rows (`Int`): * ​simd\_cols (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(rows: Int, cols: Int) -> Self` --- ## min `min(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the min element in a buffer. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The minimum of the buffer elements. `min[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the min across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `min[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the min across the input and output shape. This performs the min computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the min on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## min `min(x: Int, y: Int, /) -> Int` Gets the minimum of two integers. **Args:** * ​x (`Int`): Integer input to min. * ​y (`Int`): Integer input to min. **Returns:** Minimum of x and y. `min(x: UInt, y: UInt, /) -> UInt` Gets the minimum of two integers. **Args:** * ​x (`UInt`): Integer input to min. * ​y (`UInt`): Integer input to min. **Returns:** Minimum of x and y. `min[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]` Gets the elementwise minimum of x and y. An element of the result SIMD vector will be the minimum of the corresponding elements in x and y. **Constraints:** The type of the inputs must be numeric or boolean. **Parameters:** * ​dtype (`DType`): The data type of the SIMD vector. **Args:** * ​x (`SIMD[dtype, size]`): First SIMD vector. * ​y (`SIMD[dtype, size]`): Second SIMD vector. **Returns:** A SIMD vector containing the elementwise minimum of x and y. `min[T: Copyable & LessThanComparable](x: T, *ys: T) -> T` Gets the minimum value from a sequence of values. **Parameters:** * ​T (`Copyable & LessThanComparable`): A type that is both copyable and comparable with less than. **Args:** * ​x (`T`): The first value to compare. * ​\*ys (`T`): Zero or more additional values to compare. **Returns:** The minimum value from the input sequence. --- ## min `min[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]` Computes the minimum value across all threads in a block. Performs a parallel reduction using warp-level operations and shared memory to find the global minimum across all threads in the block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. * ​broadcast (`Bool`): If True, the final minimum is broadcast to all threads in the block. If False, only the first thread will have the complete min. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find the minimum. **Returns:** If broadcast is True, each thread in the block will receive the minimum value across the entire block. Otherwise, only the first thread will have the complete result. --- ## min `min[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the minimum value across all lanes in a warp. This is a convenience wrapper around lane\_group\_min that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global minimum value across all lanes in the warp. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum. **Returns:** A SIMD value where all lanes contain the minimum value found across the entire warp. The minimum value is broadcast to all lanes. --- ## min_finite `min_finite[dtype: DType]() -> SIMD[dtype, 1]` Returns the minimum (lowest) finite value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The minimum representable value of the type. Does not include negative infinity for floating-point types. --- ## min_or_neg_inf `min_or_neg_inf[dtype: DType]() -> SIMD[dtype, 1]` Returns the minimum (potentially negative infinite) value of type. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The minimum representable value of the type. Can include negative infinity for floating-point types. --- ## min_p_sampling `min_p_sampling[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](min_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P). --- ## min_p_sampling_gpu `min_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, min_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P). --- ## mkdir `mkdir[PathLike: PathLike](path: PathLike, mode: Int = 511)` Creates a directory at the specified path. If the directory can not be created an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. * ​mode (`Int`): The mode to create the directory with. --- ## mkdtemp `mkdtemp(suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None)) -> String` Create a temporary directory. Caller is responsible for deleting the directory when done with it. **Args:** * ​suffix (`String`): Suffix to use for the directory name. * ​prefix (`String`): Prefix to use for the directory name. * ​dir (`Optional[String]`): Directory in which the directory will be created. **Returns:** The name of the created directory. **Raises:** If the directory can not be created. --- ## mla ## Functions * [​`flare_mla_decoding`](./flare_mla_decoding): MLA decoding kernel that would only be called in the optimized compute graph. * [​`flare_mla_decoding_dispatch`](./flare_mla_decoding_dispatch): * [​`flare_mla_prefill`](./flare_mla_prefill): MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs. * [​`flare_mla_prefill_dispatch`](./flare_mla_prefill_dispatch): * [​`mla_decoding`](./mla_decoding): * [​`mla_decoding_single_batch`](./mla_decoding_single_batch): Flash attention v2 algorithm. * [​`mla_prefill`](./mla_prefill): * [​`mla_prefill_plan`](./mla_prefill_plan): This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer. * [​`mla_prefill_plan_kernel`](./mla_prefill_plan_kernel): * [​`mla_prefill_single_batch`](./mla_prefill_single_batch): MLA for encoding where seqlen > 1. --- ## mla_decoding `mla_decoding[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)` --- ## mla_decoding_single_batch `mla_decoding_single_batch[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, depth_v: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` Flash attention v2 algorithm. --- ## mla_prefill `mla_prefill[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, softmax_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 128, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, _ndbuffer_mha_operand: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)` --- ## mla_prefill_plan `mla_prefill_plan[cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1], ctx: DeviceContext)` This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer. Each sequence in the batch has some existing cached tokens and new input tokens. The kernel divides the total tokens into chunks of buffer\_token\_size. For each chunk (iteration), it calculates: 1\. Buffer offsets for each sequence in each chunk 2\. Cache offsets for each sequence in each chunk 3\. Total buffer lengths for each processing iteration --- ## mla_prefill_plan_kernel `mla_prefill_plan_kernel[buffer_lengths_shape: DimList, cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], cache_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], buffer_lengths: NDBuffer[int32, 1, MutableAnyOrigin, buffer_lengths_shape], input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1])` --- ## mla_prefill_single_batch `mla_prefill_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], cache_start_pos: SIMD[uint32, 1], num_keys: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)` MLA for encoding where seqlen > 1. --- ## mma This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions. ## Structs * [​`WGMMADescriptor`](/mojo/stdlib/gpu/mma/WGMMADescriptor): Descriptor for shared memory operands used in warp group matrix multiply operations. ## Functions * [​`ld_matrix`](/mojo/stdlib/gpu/mma/ld_matrix): Loads a matrix from shared memory into registers in a format suitable for tensor core operations. * [​`mma`](/mojo/stdlib/gpu/mma/mma): Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation. * [​`st_matrix`](/mojo/stdlib/gpu/mma/st_matrix): Performs warp-synchronized copy from registers to shared memory. * [​`wgmma_async`](/mojo/stdlib/gpu/mma/wgmma_async): Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. * [​`wgmma_commit_group_sync`](/mojo/stdlib/gpu/mma/wgmma_commit_group_sync): Commits pending warp group matrix multiply operations. * [​`wgmma_fence_aligned`](/mojo/stdlib/gpu/mma/wgmma_fence_aligned): Inserts a memory fence for warp group matrix multiply operations. * [​`wgmma_wait_group_sync`](/mojo/stdlib/gpu/mma/wgmma_wait_group_sync): Waits for all pending warp group matrix multiply operations to complete. --- ## mma `mma[block_size: Int = 1](mut d: SIMD[dtype, size], a: SIMD[dtype, size], b: SIMD[dtype, size], c: SIMD[dtype, size])` Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation. This function executes a matrix multiply-accumulate operation using GPU Tensor Cores, synchronizing across the warp. It dispatches to architecture-specific implementations for NVIDIA and AMD GPUs. The operation performed is: d = (a \* b) + c Supported configurations depend on the GPU architecture: * NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats * AMD: Limited subset of FP32 and FP16 operations Note: * All threads in a warp must execute this operation together * Input matrices must be properly loaded and formatted for Tensor Core operations * Matrix dimensions and data types must match hardware requirements **Parameters:** * ​block\_size (`Int`): The size of the block of the MMA operation (e.g., 4x4x4\_16B). Applies to AMD GPUs only. **Args:** * ​d (`SIMD[dtype, size]`): Output SIMD vector to store the result. * ​a (`SIMD[dtype, size]`): First input matrix as SIMD vector. * ​b (`SIMD[dtype, size]`): Second input matrix as SIMD vector. * ​c (`SIMD[dtype, size]`): Accumulator matrix as SIMD vector. --- ## mma `mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: MMASmemDescriptor, b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])` Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. **Parameters:** * ​kind (`UMMAKind`): Data type of the matrices. * ​cta\_group (`Int`): Number of ctas used by MMA. * ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1. **Args:** * ​a\_desc (`MMASmemDescriptor`): The descriptor for the A matrix. * ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix. * ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory. * ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction. `mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: SIMD[uint32, 1], b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])` Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. **Parameters:** * ​kind (`UMMAKind`): Data type of the matrices. * ​cta\_group (`Int`): Number of ctas used by MMA. * ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1. **Args:** * ​a\_desc (`SIMD[uint32, 1]`): The descriptor for the A matrix. * ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix. * ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory. * ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction. --- ## mma_arrive `mma_arrive[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin])` Arrive at the mbar pointer for the MMA instruction. **Parameters:** * ​cta\_group (`Int`): Number of ctas used by MMA. **Args:** * ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar. --- ## mma_arrive_multicast `mma_arrive_multicast[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], cta_mask: SIMD[uint16, 1])` Arrive at the mbar pointer for the MMA instruction for multiple ctas. **Parameters:** * ​cta\_group (`Int`): Number of ctas used by MMA. **Args:** * ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar. * ​cta\_mask (`SIMD[uint16, 1]`): Mask of ctas to signal. --- ## mma_sm100 This module includes utilities for working with the SM100 MMA instructions. ## Structs * [​`MMASmemDescriptor`](/mojo/stdlib/gpu/mma_sm100/MMASmemDescriptor): Descriptor for shared memory operands tcgen05 mma instructions. * [​`UMMAInsDescriptor`](/mojo/stdlib/gpu/mma_sm100/UMMAInsDescriptor): Descriptor for UMMA instructions. * [​`UMMAKind`](/mojo/stdlib/gpu/mma_sm100/UMMAKind): Struct for UMMA instruction types. ## Functions * [​`mma`](/mojo/stdlib/gpu/mma_sm100/mma): Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction. * [​`mma_arrive`](/mojo/stdlib/gpu/mma_sm100/mma_arrive): Arrive at the mbar pointer for the MMA instruction. * [​`mma_arrive_multicast`](/mojo/stdlib/gpu/mma_sm100/mma_arrive_multicast): Arrive at the mbar pointer for the MMA instruction for multiple ctas. --- ## mma_util Matrix multiply accumulate (MMA) utilities for GPU tensor cores. This module provides functions for loading matrix tiles from memory into registers and storing results back to memory when using tensor cores for matrix multiplication. It supports both NVIDIA and AMD GPUs with functions specialized for different data types (FP32, FP16, BF16). The key functions are: * load\_matrix\_a: Loads tiles from the first input matrix A * load\_matrix\_b: Loads tiles from the second input matrix B * store\_matrix\_d: Stores result tiles to the output matrix D Each function handles the specific memory access patterns required by the tensor core instructions on each GPU architecture. The tile sizes and data layouts match the hardware requirements documented in: NVIDIA PTX: AMD Matrix Cores: ## Functions * [​`load_matrix_a`](/mojo/stdlib/gpu/mma_util/load_matrix_a): Loads a tile of matrix A from memory to registers for TF32 tensor core operations. * [​`load_matrix_a_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_a_amd): Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations. * [​`load_matrix_b`](/mojo/stdlib/gpu/mma_util/load_matrix_b): Loads a tile of matrix B from memory to registers for TF32 tensor core operations. * [​`load_matrix_b_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_b_amd): Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations. * [​`store_matrix_d`](/mojo/stdlib/gpu/mma_util/store_matrix_d): Stores matrix D tile from registers to memory after tensor core operation. --- ## MMASmemDescriptor `@register_passable(trivial)` `struct MMASmemDescriptor` Descriptor for shared memory operands tcgen05 mma instructions. This struct represents a descriptor that encodes information about shared memory layout and access patterns for warp group matrix multiply operations. The descriptor contains the following bit fields: bits layout: Bit-field | size | Description 0-13 | 14 | Base address in shared memory 14-15 | 2 | Unused, 0 16-29 | 14 | LBO: leading dim byte offset 30-31 | 2 | Unused, 0 32-45 | 14 | SBO: stride dim byte offset 46-48 | 3 | Unused, 0 49-51 | 3 | Matrix Base offset, 0 for canonical layouts 52 | 1 | LBO mode, only matters for 48B K tile 53-60 | 8 | fixed, 0 61-63 | 3 | Swizzle mode * Start address, LBO, SBO ingnores 4 LSBs. See ## Fields * ​desc (`SIMD[uint64, 1]`): The 64-bit descriptor encodes shared memory operand information. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(val: SIMD[uint64, 1]) -> Self` Initialize descriptor with raw 64-bit value. This constructor allows creating a descriptor directly from a 64-bit integer that already contains the properly formatted bit fields for the descriptor. The implicit attribute enables automatic conversion from `UInt64` to `MMASmemDescriptor`. **Args:** * ​val (`SIMD[uint64, 1]`): A 64-bit integer containing the complete descriptor bit layout. ### `__add__` `__add__(self, offset: Int) -> Self` Add offset to descriptor's base address. **Args:** * ​offset (`Int`): Byte offset to add to base address. **Returns:** New descriptor with updated base address. ### `__iadd__` `__iadd__(mut self, offset: Int)` Add offset to descriptor's base address in-place. **Args:** * ​offset (`Int`): Byte offset to add to base address. ### `create` `static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Self` Create a descriptor for shared memory operand. **Parameters:** * ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes. * ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode. **Args:** * ​smem\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory operand. **Returns:** Initialized descriptor for the shared memory operand. --- ## Mode `struct Mode` Defines a Benchmark Mode to distinguish between test runs and actual benchmarks. ## Fields * ​value (`Int`): Represents the mode type. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Benchmark` `alias Benchmark = Mode(0)` ### `Test` `alias Test = Mode(1)` ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if its Benchmark mode or test mode. **Args:** * ​other (`Self`): The mode to be compared against. **Returns:** If its a test mode or benchmark mode. --- ## Model ```c #include "max/c/model.h" ``` ## Functions ### `M_newCompileConfig()` > [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*M\_newCompileConfig() Creates an object you can use to configure model compilation. You need `M_CompileConfig` as an argument for several functions, including [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7), `M_setTorchInputSpecs()`, and [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). * **Returns:** A pointer to a new compilation configuration. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompileConfig()`](#model_8h_1abbf74b13adaf5bc8a0bb4d46c40688d9). This compilation configuration can only be used for a single compilation call. Any subsequent compilations must be passed a new `M_CompileConfig` (created by calling [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665) again). ### `M_cloneCompileConfig()` > [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*M\_cloneCompileConfig([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*other) Clones an object you can use to configure model compilation. * **Returns:** A pointer to a deep-cloned compilation configuration. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompileConfig()`](#model_8h_1abbf74b13adaf5bc8a0bb4d46c40688d9). This compilation configuration can only be used for a single compilation call. Any subsequent compilations must be passed a new `M_CompileConfig` (created by calling [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665) or [`M_cloneCompileConfig()`](#model_8h_1a964d9da1706841788fc492d527c116dc) again). ### `M_setModelPath()` > void M\_setModelPath([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*compileConfig, const char \*path) Sets the path to a model. You must call this before you call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). Otherwise, [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) returns an error in `status`. Note: PyTorch models must be in TorchScript format. * **Parameters:** * **compileConfig** – The compilation configuration for your model, from [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665). * **path** – The path to your model. The model does not need to exist on the filesystem at this point. This follows the same semantics and expectations as `std::filesystem::path`. ### `M_newModelSource()` > [M\_ModelSource](types.md#_CPPv413M_ModelSource) \*M\_newModelSource(void \*source, [M\_FrameworkFormat](types.md#_CPPv417M_FrameworkFormat) format) Creates an opaque torchscript model representation. * **Parameters:** * **source** – A pointer to the model representation. * **format** – The framework format matching the model representation. * **Returns:** A pointer to the opaque model representation. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeModelSource()`](#model_8h_1a1c4b2248fdfed4c9f0dbabe846e6a990). ### `M_setModelSource()` > void M\_setModelSource([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*compileConfig, [M\_ModelSource](types.md#_CPPv413M_ModelSource) \*modelSource) Sets the opaque representation of the model for compilation. You must call this or [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7) before you call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). Otherwise, [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) returns an error in `status`. * **Parameters:** * **compileConfig** – The compilation configuration for your model, from [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665). * **modelSource** – The opaque representation of your model. ### `M_compileModel()` > [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*M\_compileModel(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*\*compileConfig, [M\_Status](types.md#_CPPv48M_Status) \*status) Compiles a model. This immediately returns an `M_AsyncCompiledModel`, with compilation happening asynchronously. If you need to block to await compilation, you can then call [`M_waitForCompilation()`](#model_8h_1a8040a6488596f863c205d769d92ad013). You must call [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7) before you call this. For example: ```c M_CompileConfig *compileConfig = M_newCompileConfig(); M_setModelPath(compileConfig, modelPath); M_AsyncCompiledModel *compiledModel = M_compileModel(context, &compileConfig, status); if (M_isError(status)) { logError(M_getError(status)); return EXIT_FAILURE; } ``` When using a TorchScript model, you must also specify the input shapes via `M_setTorchInputSpecs()` before you compile it. The `M_AsyncCompiledModel` returned here is not ready for inference yet. You need to then initialize the model with [`M_initModel()`](#model_8h_1a2dcb9570ae117602579182d8faed494a). * **Parameters:** * **context** – The runtime context, from [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175). * **compileConfig** – Address of compilation configuration for your model created with [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665), and with the model set via [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7). Ownership of configuration is handed over to API. * **status** – The status used to report errors in the case of failures during model compilation. * **Returns:** A pointer to an `M_AsyncCompiledModel`. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompiledModel()`](#model_8h_1a5b6846eb4d47d445eb65c305b1c81b1c). If the config is invalid, it returns a `NULL` pointer. If the model compilation fails, the pointer is `NULL` and the `status` parameter contains an error message. `compileConfig` will be reset to `NULL` after this call irrespective of status and cannot be reused, and any subsequent calls must take a new `M_CompileConfig`. ### `M_waitForCompilation()` > void M\_waitForCompilation([M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*compiledModel, [M\_Status](types.md#_CPPv48M_Status) \*status) Blocks execution until the model is compiled. This waits for the async compiled model to be complete after calling [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). When this function returns, the model is resolved to either a compiled model or an error. * **Parameters:** * **compiledModel** – The model received from [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). * **status** – The status used to report errors in the case of failures. ### `M_compileModelSync()` > [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*M\_compileModelSync(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*\*compileConfig, [M\_Status](types.md#_CPPv48M_Status) \*status) Synchronously compiles a model. Unlike [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750), this blocks until model compilation is complete. It returns an `M_AsyncCompiledModel` without needing to call [`M_waitForCompilation()`](#model_8h_1a8040a6488596f863c205d769d92ad013). All other setup and usage is identical to [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). * **Parameters:** * **context** – The runtime context, from [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175). * **compileConfig** – Address of compilation configuration for your model created with [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665), and with the model set via [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7). Ownership of configuration is handed over to API. * **status** – The status used to report errors in the case of failures during model compilation. * **Returns:** A pointer to an `M_AsyncCompiledModel`. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompiledModel()`](#model_8h_1a5b6846eb4d47d445eb65c305b1c81b1c). If the config is invalid, it returns a `NULL` pointer. If the model compilation fails, the pointer is `NULL` and the `status` parameter contains an error message. `compileConfig` will be reset to `NULL` after this call irrespective of status and cannot be reused, and any subsequent calls must take a new `M_CompileConfig`. ### `M_initModel()` > [M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*M\_initModel(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*compiledModel, const [M\_WeightsRegistry](types.md#_CPPv417M_WeightsRegistry) \*weightsRegistry, [M\_Status](types.md#_CPPv48M_Status) \*status) Sets up a model for execution. You can call this immediately after [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750)—you don’t need to wait for the async compilation. This function also returns immediately with model initialization happening asynchronously. For example: ```c M_AsyncModel *model = M_initModel( context, compiledModel, weightsRegistry, status); if (M_isError(status)) { logError(M_getError(status)); return EXIT_FAILURE; } ``` If you want to block until `M_AsyncModel` is initialized, you can call [`M_waitForModel()`](#model_8h_1a852bec3f80cebb5c06911091d5cab349), but that’s not necessary and you can immediately call [`M_executeModelSync()`](#model_8h_1a2ced4683834a77d0b943a6bc72d846d5). * **Parameters:** * **context** – The runtime context, from [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175). * **compiledModel** – The compiled model, from [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). * **weightsRegistry** – A mapping from weights’ names to their data. The weights registry is used to update weights or otherwise pass weights to the model init block at runtime, without recompiling the model graph. If the model doesn’t use the weights registry, it is safe to pass as NULL * **status** – The status used to report errors in the case of failures. The status contains an error only if the given context or compiled model is invalid. Other errors will not surface until the next synchronization point. * **Returns:** A pointer to an `M_AsyncModel` that holds an async value to a compiled model. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeModel()`](#model_8h_1a4094fa8e414f8b6a6563474f8840d33c). If model initialization fails, the `status` parameter contains an error message. ### `M_getInputNames()` > [M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*M\_getInputNames(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets all input tensor names. * **Parameters:** * **model** – The compiled model. * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model is invalid. * **Returns:** An array of input tensor names or a `NULL` pointer if the model is invalid. If `NULL`, the `status` parameter contains an error message. Callers are responsible for freeing the returned array by calling [`M_freeTensorNameArray()`](tensor.md#tensor_8h_1a7fa5d2aff7f89143ae1905fc29b5b112). ### `M_getOutputNames()` > [M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*M\_getOutputNames(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets all output tensor names * **Parameters:** * **model** – The compiled model. * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model is invalid. * **Returns:** An array of output tensor names or a `NULL` pointer if the model is invalid. If `NULL`, the `status` parameter contains an error message. Callers are responsible for freeing the returned array by calling [`M_freeTensorNameArray()`](tensor.md#tensor_8h_1a7fa5d2aff7f89143ae1905fc29b5b112). ### `M_getTensorNameAt()` > const char \*M\_getTensorNameAt(const [M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*tensorNameArray, size\_t index) Gets the tensor name in `tensorNameArray` at `index`. * **Parameters:** * **tensorNameArray** – The tensor name array. * **index** – The index of the tensor name to get. * **Returns:** A pointer to the tensor name at `index` or a `NULL` pointer if the index is out of bounds, or if `tensorNameArray` is `NULL`. The returned string is owned by `tensorNameArray`. The returned string is null terminated. ### `M_getModelInputSpecByName()` > [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_getModelInputSpecByName(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, const char \*tensorName, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets the specifications for an input tensor by the tensor’s name. * **Parameters:** * **model** – The compiled model. * **tensorName** – The name of the input tensor. * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model or `tensorName` is invalid. * **Returns:** A pointer to an `M_TensorSpec`, or a `NULL` pointer if the model or index is invalid. If `NULL`, the `status` parameter contains an error message. ### `M_getModelOutputSpecByName()` > [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_getModelOutputSpecByName(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, const char \*tensorName, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets the specifications for an output tensor by the tensor’s name. * **Parameters:** * **model** – The compiled model. * **tensorName** – The name of the output tensor. * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model or `tensorName` is invalid. * **Returns:** A pointer to an `M_TensorSpec`, or a `NULL` pointer if the model or index is invalid. If `NULL`, the `status` parameter contains an error message. ### `M_waitForModel()` > void M\_waitForModel([M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status) Blocks execution until the model is initialized. This waits for the model setup to finish in [`M_initModel()`](#model_8h_1a2dcb9570ae117602579182d8faed494a). * **Parameters:** * **model** – The model. * **status** – The status used to report errors in the case of failures. ### `M_executeModelSync()` > [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*M\_executeModelSync(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*initializedModel, [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*inputs, [M\_Status](types.md#_CPPv48M_Status) \*status) Executes a model synchronously. The inputs and outputs are `M_AsyncTensorMap` objects to allow chaining of inference. This operation is blocking and waits until the output results are ready. * **Parameters:** * **context** – The runtime context. * **initializedModel** – The model to execute, from [`M_initModel()`](#model_8h_1a2dcb9570ae117602579182d8faed494a). Although that function is async, you can pass the `M_AsyncModel` here immediately. * **inputs** – The tensor inputs. * **status** – The status used to report errors in the case of failures. This includes failures encountered while running the model; there is no need for an explicit synchronization point. * **Returns:** A pointer to an `M_AsyncTensorMap` that holds the output tensors. These tensors are in a resolved state. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeAsyncTensorMap()`](tensor.md#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e). In the case that executing the model fails, the `status` parameter contains an error message. ### `M_getNumModelInputs()` > size\_t M\_getNumModelInputs(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets the number of inputs for the model. If the model is not yet resolved/ready, this function blocks execution. You should call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) before calling this. * **Parameters:** * **model** – The compiled model. * **status** – The status used to report errors in the case of failures. * **Returns:** The number of inputs for the model, or `0` if there is an error in getting the model metadata. If `0`, the `status` parameter contains an error message. ### `M_getNumModelOutputs()` > size\_t M\_getNumModelOutputs(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets the number of outputs for the model. If the model is not yet resolved/ready, this function blocks execution. You should call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) before calling this. * **Parameters:** * **model** – The compiled model. * **status** – The status used to report errors in the case of failures. * **Returns:** The number of outputs for the model, or `0` if there is an error in getting the model metadata. If `0`, the `status` parameter contains an error message. ### `M_validateInputTensorSpec()` > void M\_validateInputTensorSpec(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensors, [M\_Status](types.md#_CPPv48M_Status) \*status) Validate input tensor specs for compatibility with the compiled model. The status message shows which validation check failed for the input. * **Parameters:** * **model** – The compiled model. * **tensors** – The tensors whose specs need to be validated * **status** – The status used to report errors in the case of failures. * **Returns:** True if the `tensors` has valid specs for the `model` ### `M_freeModel()` > void M\_freeModel([M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*model) Deallocates the memory for the model. No-op if `model` is `NULL`. * **Parameters:** **model** – The model to deallocate. ### `M_freeCompiledModel()` > void M\_freeCompiledModel([M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model) Deallocates the memory for the compiled model. No-op if `model` is `NULL`. * **Parameters:** **model** – The compiled model to deallocate. ### `M_freeCompileConfig()` > void M\_freeCompileConfig([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*config) Deallocates the memory for the compile config. No-op if `config` is `NULL`. * **Parameters:** **config** – The compilation configuration to deallocate. ### `M_freeModelSource()` > void M\_freeModelSource([M\_ModelSource](types.md#_CPPv413M_ModelSource) \*modelSource) Deallocates the memory for the model source. No-op if `modelSource` is `NULL`. * **Parameters:** **modelSource** – The model source to deallocate. ### `M_exportCompiledModel()` > void M\_exportCompiledModel([M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, const char \*path, [M\_Status](types.md#_CPPv48M_Status) \*status) Exports a compiled model as a MEF to a given path. * **Parameters:** * **model** – The model instance to export. * **path** – The path of the MEF file to export. * **status** – The status used to report errors in the case of failures. --- ## Model support import TutorialStack from '@site/src/components/TutorialStack'; MAX allows you to pick the perfect GenAI for your project from Hugging Face. You just provide the name of the model you want, and MAX takes care of the rest. It builds the model as a high-performance graph and starts a serving endpoint that runs the model on either a CPU and GPU. This page explains how this works out of the box with models from Hugging Face, and introduces how you can customize an existing model or create your own. :::note MAX model repo If you just want to browse some models, check out the [MAX model repository](https://builds.modular.com/?category=models&type=MAX+Model). ::: ## Model configs To understand how MAX accelerates hundreds of GenAI models from Hugging Face, you should first know a little about Hugging Face model configurations. Nowadays, the definitive place to find AI models is [Hugging Face Model Hub](https://huggingface.co/models). Although models on Hugging Face might be built and trained with different machine learning frameworks, they all include a `config.json` file, which is like a model blueprint. This file contains all the information you need to understand the model architecture and its configuration, such as the number of layers used, the embedding size, and other hyperparameters. By reading the model configuration, we can reconstruct any model from Hugging Face as a MAX model. ## MAX models {#max-graph} A MAX model is a high-performance inferencing model built with our [MAX Python API](/max/api/python/). It's a unique model format that allows the MAX graph compiler to optimize the model for inference on a wide range of hardware and deliver state-of-the-art performance you normally see only from model-specific inference libraries written in C or C++. You can build these models yourself with our Python API, but you don't have to. All you have to do is specify the GenAI model you want from Hugging Face (such as [`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)), and MAX will programmatically reconstruct it as a MAX model. This works because we have already built a library of [base model architectures](https://github.com/modular/modular/tree/main/max/pipelines/architectures) with the MAX Python API. When you ask MAX to start an inference server with a Hugging Face model, MAX pulls the corresponding pre-built architecture from our library and makes the appropriate changes based on the configuration from Hugging Face. This all happens automatically when you start a serving endpoint with the [`max`](/max/max-cli) CLI or with the [MAX container](/max/container). For example, here's how to start an endpoint using Meta's Llama 3.2 Instruct model as a MAX model: ```sh max serve --model-path=meta-llama/Llama-3.2-1B-Instruct ``` :::caution This model requires a GPU The command above will fail if your system doesn't have a [compatible GPU](/max/faq#gpu-requirements). However, you can make it work if you instead [load quantized weights](#customize-a-model) as shown below. ::: When you run the `max serve` command, MAX pulls the model configuration and weights from Hugging Face and builds it as a MAX model. Then it starts up an endpoint to handle inference requests that you send using [our REST API](/max/api/serve). ### Customize a model If you want to load a different set of weights for a given model, you can pass them in GGUF or Safetensors format using the `--weight-path` argument. This accepts either a local path or a Hugging Face repo with the weights. For example, here's how to run `Llama-3.2-1B-Instruct` on a CPU with quantized weights ([from bartowski](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)): ```sh max serve --model-path=meta-llama/Llama-3.2-1B-Instruct \ --weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf ``` When using GGUF models, quantization encoding formats are automatically detected. When using the `max` command with a model from a Hugging Face repository, explicitly providing a quantization encoding is optional. ```sh max serve --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \ --quantization-encoding=q4_k ``` If no quantization encoding is specified, MAX Serve automatically detects and uses the first encoding option from the repository. If a quantization encoding is provided, it must align with the available encoding options in the repository. If the repository contains multiple quantization formats, be sure to specify which encoding type you want to use. For help creating your own weights in GGUF format, see the tutorial to [Bring your own fine-tuned model](/max/tutorials/max-pipeline-bring-your-own-model). ### Build your own model Although our model-building APIs are still under heavy development while we implement the most popular architectures, you can also build your own models with the MAX APIs today. To build your own inferencing model with the MAX, the process generally looks like this: 1. Instantiate a [`Graph`](/max/api/python/graph/Graph) by specifying the input shape as a [`TensorType`](/max/api/python/graph/type#max.graph.type.TensorType). 2. Build the graph by chaining [`ops`](/max/api/python/graph/ops/) functions. Each function takes and returns a [`Value`](/max/api/python/graph/Value) object. 3. Add the final `Value` to the graph using the [`output()`](/max/api/python/graph/Graph#max.graph.Graph.output) method. For more information, see our tutorial to [get started with MAX Graph in Python](/max/tutorials/get-started-with-max-graph-in-python). ## PyTorch eager mode As you might suspect, MAX doesn't have a pre-built architecture to match *every* model on Hugging Face. But that's fine, because MAX also supports eager-mode execution for all other PyTorch LLMs (using the Hugging Face Transformers API). If MAX doesn't have a pre-built model architecture for the Hugging Face model you pass in, it falls back to running the model with Hugging Face Transformers. That means the model won't be compiled and accelerated with MAX, but you'll still get an endpoint with [our serving API](/max/api/serve) that's OpenAI-compatible. However, this is an increasingly unlikely situation for popular GenAI models, because most of the popular models are based on a handful of architectures that we've implemented as MAX models. For example, there are thousands of models based on the `LlamaForCausalLM` architecture. You can see the most popular models that work with MAX today (either as MAX models or with eager mode) in [the MAX model repository](https://builds.modular.com/?category=models&type=MAX+Model). ## Get started export const tutorials = [ 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes', ]; --- ## modf `modf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> Tuple[SIMD[dtype, width], SIMD[dtype, width]]` Computes the integral and fractional part of the value. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input value. **Returns:** A tuple containing the integral and fractional part of the value. --- ## Modular Documentation import Homepage, { GetStartedButton } from '@site/src/components/Homepage'; import CodeNote from '@site/src/components/Homepage/CodeNote'; import { ArrowTransfer } from '@site/src/shared/Svgs/ArrowTransfer'; import { ArrowCloud } from '@site/src/shared/Svgs/ArrowCloud'; import { DesktopCode } from '@site/src/shared/Svgs/DesktopCode'; import { AIChip } from '@site/src/shared/Svgs/AIChip'; import { RecipesIcon } from '@site/src/shared/Svgs/RecipesIcon'; import { OpenBook } from '@site/src/shared/Svgs/OpenBook'; import { PuzzleIcon } from '@site/src/shared/Svgs/PuzzleIcon'; ## Modular Documentation The Modular Platform accelerates AI inference and abstracts hardware complexity. Using our Docker container, you can deploy a GenAI model from Hugging Face with an OpenAI-compatible endpoint on a wide range of hardware. And if you need to customize the model or tune a GPU kernel, Modular provides a depth of model extensibility and GPU programmability that you won't find anywhere else. ```python title="python" from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="EMPTY") completion = client.chat.completions.create( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[ {"role": "user", "content": "Who won the world series in 2020?"} ], ) print(completion.choices[0].message.content) ``` export const sectionCards = [ { title: 'Serving', description: 'Modular’s serving library is compatible with OpenAI APIs, so you can own your endpoint with minimal client-side code changes.', to: '/max/serve/', icon: , }, { title: 'Deploying', description: 'You can quickly deploy your GenAI model to the cloud using our ready-to-deploy Docker container.', to: '/max/deploy/', icon: , }, { title: 'Developing', description: 'The Modular platform provides full extensibility, so you can write custom ops, hardware-agnostic GPU kernels, and more.', to: '/max/develop/', icon: , }, { title: 'Programming with Mojo🔥', description: 'Mojo is a Python-style programming language that allows you to write code for both CPUs and GPUs. ', to: '/mojo/manual/', icon: , }, ]; export const learningToolCards = [ { title: 'Tutorials', description: 'Step-by-step instructions to develop and deploy with the Modular platform.', to: '/max/tutorials/', icon: , }, { title: 'Recipes', description: 'Turn-key applications that use GenAI models with the Modular platform.', href: 'https://builds.modular.com/?category=recipes', icon: , }, { title: 'GPU Puzzles', description: 'A hands-on guide to mastering GPU programming with Mojo.', href: 'https://builds.modular.com/puzzles', icon: , }, ]; --- ## Modules and packages Mojo provides a packaging system that allows you to organize and compile code libraries into importable files. This page introduces the necessary concepts about how to organize your code into modules and packages (which is a lot like Python), and shows you how to create a packaged binary with the [`mojo package`](/mojo/cli/package) command. ## Mojo modules To understand Mojo packages, you first need to understand Mojo modules. A Mojo module is a single Mojo source file that includes code suitable for use by other files that import it. For example, you can create a module to define a struct such as this one: ```mojo title="mymodule.mojo" struct MyPair: var first: Int var second: Int fn __init__(out self, first: Int, second: Int): self.first = first self.second = second fn dump(self): print(self.first, self.second) ``` Notice that this code has no `main()` function, so you can't execute `mymodule.mojo`. However, you can import this into another file with a `main()` function and use it there. For example, here's how you can import `MyPair` into a file named `main.mojo` that's in the same directory as `mymodule.mojo`: ```mojo title="main.mojo" from mymodule import MyPair fn main(): var mine = MyPair(2, 4) mine.dump() ``` Alternatively, you can import the whole module and then access its members through the module name. For example: ```mojo title="main.mojo" import mymodule fn main(): var mine = mymodule.MyPair(2, 4) mine.dump() ``` You can also create an alias for an imported member with `as`, like this: ```mojo title="main.mojo" import mymodule as my fn main(): var mine = my.MyPair(2, 4) mine.dump() ``` In this example, it only works when `mymodule.mojo` is in the same directory as `main.mojo`. Currently, you can't import `.mojo` files as modules if they reside in other directories. That is, unless you treat the directory as a Mojo package, as described in the next section. :::note A Mojo module may include a `main()` function and may also be executable, but that's generally not the practice and modules typically include APIs to be imported and used in other Mojo programs. ::: ## Mojo packages A Mojo package is just a collection of Mojo modules in a directory that includes an `__init__.mojo` file. By organizing modules together in a directory, you can then import all the modules together or individually. Optionally, you can also compile the package into a `.mojopkg` or `.📦` file that's easier to share and still compatible with other system architectures. You can import a package and its modules either directly from source files or from a compiled `.mojopkg`/`.📦` file. It makes no real difference to Mojo which way you import a package. When importing from source files, the directory name works as the package name, whereas when importing from a compiled package, the filename is the package name (which you specify with the [`mojo package`](/mojo/cli/package) command—it can differ from the directory name). For example, consider a project with these files: ```ini main.mojo mypackage/ __init__.mojo mymodule.mojo ``` `mymodule.mojo` is the same code from examples above (with the `MyPair` struct) and `__init__.mojo` is empty. In this case, the `main.mojo` file can now import `MyPair` through the package name like this: ```mojo title="main.mojo" from mypackage.mymodule import MyPair fn main(): var mine = MyPair(2, 4) mine.dump() ``` Notice that the `__init__.mojo` is crucial here. If you delete it, then Mojo doesn't recognize the directory as a package and it cannot import `mymodule`. Then, let's say you don't want the `mypackage` source code in the same location as `main.mojo`. So, you can compile it into a package file like this: ```sh mojo package mypackage -o mypack.mojopkg ``` :::note A `.mojopkg` file contains non-elaborated code, so you can share it across systems. The code becomes an architecture-specific executable only after it's imported into a Mojo program that's then compiled with `mojo build`. ::: Now, you can move the `mypackage` source somewhere else, and the project files now look like this: ```ini main.mojo mypack.mojopkg ``` Because we named the package file different from the directory, we need to fix the import statement and it all works the same: ```mojo title="main.mojo" from mypack.mymodule import MyPair ``` :::note If you want to rename your package, you cannot simply edit the `.mojopkg` or `.📦` filename, because the package name is encoded in the file. You must instead run `mojo package` again to specify a new name. ::: ### The `__init__` file As mentioned above, the `__init__.mojo` file is required to indicate that a directory should be treated as a Mojo package, and it can be empty. Currently, top-level code is not supported in `.mojo` files, so unlike Python, you can't write code in `__init__.mojo` that executes upon import. You can, however, add structs and functions, which you can then import from the package name. However, instead of adding APIs in the `__init__.mojo` file, you can import module members, which has the same effect by making your APIs accessible from the package name, instead of requiring the `.` notation. For example, again let's say you have these files: ```ini main.mojo mypackage/ __init__.mojo mymodule.mojo ``` Let's now add the following line in `__init__.mojo`: ```mojo title="__init__.mojo" from .mymodule import MyPair ``` That's all that's in there. Now, we can simplify the import statement in `main.mojo` like this: ```mojo title="main.mojo" from mypackage import MyPair ``` This feature explains why some members in the Mojo standard library can be imported from their package name, while others required the `.` notation. For example, the [`functional`](/mojo/stdlib/algorithm/functional/) module resides in the `algorithm` package, so you can import members of that module (such as the `map()` function) like this: ```mojo from algorithm.functional import map ``` However, the `algorithm/__init__.mojo` file also includes these lines: ```mojo title="algorithm/__init__.mojo" from .functional import * from .reduction import * ``` So you can actually import anything from `functional` or `reduction` simply by naming the package. That is, you can drop the `functional` name from the import statement, and it also works: ```mojo from algorithm import map ``` :::note Which modules in the standard library are imported to the package scope varies, and is subject to change. Refer to the [documentation for each module](/mojo/lib) to see how you can import its members. ::: --- ## moe ## Functions * [​`moe_create_indices`](./moe_create_indices): * [​`moe_create_indices_kernel`](./moe_create_indices_kernel): --- ## moe_create_indices `moe_create_indices[input_type: DType, //, target: StringSlice[StaticConstantOrigin]](token_expert_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_start_indices: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], restore_token_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_ids: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_usage_stats: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], topk_ids: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], context: DeviceContextPtr)` --- ## moe_create_indices_kernel `moe_create_indices_kernel[input_type: DType, num_threads: Int, token_expert_order_layout: Layout, expert_start_indices_layout: Layout, restore_token_order_layout: Layout, expert_ids_layout: Layout, expert_usage_stats_layout: Layout, indices_padded_layout: Layout, padded_input_layout: Layout, topk_ids_layout: Layout](token_expert_order: LayoutTensor[uint32, token_expert_order_layout, MutableAnyOrigin], expert_start_indices: LayoutTensor[uint32, expert_start_indices_layout, MutableAnyOrigin], restore_token_order: LayoutTensor[uint32, restore_token_order_layout, MutableAnyOrigin], expert_ids: LayoutTensor[uint32, expert_ids_layout, MutableAnyOrigin], expert_usage_stats: LayoutTensor[uint32, expert_usage_stats_layout, MutableAnyOrigin], indices_padded: LayoutTensor[uint32, indices_padded_layout, MutableAnyOrigin], topk_ids_padded: LayoutTensor[input_type, padded_input_layout, MutableAnyOrigin], topk_ids: LayoutTensor[input_type, topk_ids_layout, MutableAnyOrigin])` --- ## mojo The Mojo🔥 command line interface. ## Synopsis ``` mojo mojo [run-options] mojo [options] mojo ``` ## Description The `mojo` CLI provides all the tools you need for Mojo development, such as commands to run, compile, and package Mojo code. A list of all commands are listed below, and you can learn more about each one by adding the `--help` option to the command (for example, `mojo package --help`). However, you may omit the `run` and `repl` commands. That is, you can run a Mojo file by simply passing the filename to `mojo`: ``` mojo hello.mojo ``` And you can start a REPL session by running `mojo` with no commands. To update Mojo to the latest version, use the [`magic` tool](/mojo/manual/get-started#update-mojo): ``` magic update ``` You can check your current version with `mojo --version`. For information about Mojo updates, see the [Mojo changelog](/mojo/changelog.html). ## Commands [`run`](run.md) — Builds and executes a Mojo file. [`build`](build.md) — Builds an executable from a Mojo file. [`repl`](repl.md) — Launches the Mojo REPL. [`debug`](debug.md) — Launches the Mojo debugger using the command-line interface or an external editor. [`package`](package.md) — Compiles a Mojo package. [`format`](format.md) — Formats Mojo source files. [`doc`](doc.md) — Compiles docstrings from a Mojo file. [`demangle`](demangle.md) — Demangles the given name. [`test`](test.md) — Execute unit, integration, and documentation tests. ## Options ### Diagnostic options #### `--version`, `-v` Prints the Mojo version and exits. ### Common options #### `--help`, `-h` Displays help information. --- ## mojo build Builds an executable from a Mojo file. ## Synopsis ``` mojo build [options] ``` ## Description Compiles the Mojo file at the given path into an executable. By default, the executable is saved to the current directory and named the same as the input file, but without a file extension. Beware that any Python libraries used in your Mojo project are not included in the executable binary, so they must be provided by the environment where you run the executable. ## Options ### Output options #### `-o ` Sets the path and filename for the executable output. By default, it outputs the executable to the same location as the Mojo file, with the same name and no extension. #### `--emit ` The type of output file to generate. * `exe` (default): emit an executable binary file. * `shared-lib`: emit a shared (dynamic) library. * `object`: (EXPERIMENTAL) emit a single object file. * `llvm`: emit LLVM IR. * `asm`: emit target assembly. ### Compilation options #### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)` Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3. #### `-I ` Appends the given path to the list of directories to search for imported Mojo files. #### `-D ` Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`. #### `--debug-level `, `-g (LEVEL=full)` Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed. #### `--num-threads `, `-j` Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads). ### Target options #### `--target-triple ` Sets the compilation target triple. Defaults to the host target. #### `--target-cpu ` Sets the compilation target CPU. Defaults to the host CPU. #### `--target-features ` Sets the compilation target CPU features. Defaults to the host features. #### `--march ` Sets the architecture for which to generate code. #### `--mcpu ` Sets the CPU for which to generate code. #### `--mtune ` Sets the CPU for which to tune code. ### Compilation diagnostic options Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code. #### `--diagnose-missing-doc-strings` Emits diagnostics for missing or partial doc strings. #### `--validate-doc-strings` Emits errors for invalid doc strings instead of warnings. #### `--max-notes-per-diagnostic ` When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10. #### `--disable-builtins` Do not use builtins when create package. #### `--disable-warnings` Do not print warning messages. ### Experimental compilation options #### `--sanitize ` Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues). #### `--shared-libasan` Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option. #### `--debug-info-language ` Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo. ### Common options #### `--diagnostic-format ` The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default). #### `--help`, `-h` Displays help information. --- ## mojo debug Launches the Mojo debugger using the command-line interface or an external editor. ## Synopsis ``` mojo debug [debug-options] ``` ## Description This command, which underneath uses the LLDB debugger, or cuda-gdb, offers four basic debug session modes: * Build and debug a Mojo file. ``` mojo debug [options] [runtime args] ``` Builds the Mojo file at the given path and launches it under the debugger. Options, which come before the Mojo file, can include any compilation options expected by the `mojo run`, as well as regular debuggingcommands. Runtime args, which come after the Mojo file, are passed directly to the debuggee upon launch. By default, this mode uses `-O0` and `--debug-level=full` as compilation options. * Debug a precompiled program. ``` mojo debug [options] [runtime args] ``` Launches the program at the given path in the debugger. Options, which come before the program path, cannot include compilation commands. Runtime args, which come after the program path, are passed directly to the debuggee upon launch. * Attach to a running process. ``` mojo debug [options] [--pid | --process-name ] ``` Attaches to the process specified by pid or name, which can be the full path of the process' executable. Options other than the process identifier cannot include compilation options. * Start the debugger command-line interface. ``` mojo debug [options] ``` Launches the debugger CLI with support for debugging Mojo programs. This command only supports LLDB or cuda-gdb options via the `--X` option. You can also select one of two interfaces for the debug session: * CLI: By default, all debug session modes are launched using the regular debugger command-line interface. * VS Code Debug Server: If you add the `--vscode` option, the debug session is launched in VS Code via the Mojo extension. VS Code must be running and the Mojo extension must be enabled. Besides that, the environment variables and the current working directory of this invocation are preserved when launching programs in the debugger on VS Code. Finally, it is worth mentioning that this debugger can debug programs written in other standard native languages like Rust, C and C++, as it is based on LLDB or cuda-gdb. Debugger capabilitis: * LLDB: this is the default debugger and has great support for CPU Mojo code, but has no support at all for Mojo GPU code. * cuda-gdb: this is invoked via the `--cuda-gdb` option and has minimal support for CPU Mojo code but it has support for GPU Mojo code. ## Options ### Attach options #### `--pid ` Indicates the debugger to attach to the process with the given PID. #### `--process-name ` Indicates the debugger to attach to the process with the given name or path. ### cuda-gdb options #### `--cuda-gdb` Uses cuda-gdb instead of LLDB for debugging. In this mode, it's possible to step into GPU code, but the CPU debugging experience is degraded. #### `--cuda-gdb-path ` Uses the given CUDA\_GDB\_PATH instead of looking for cuda-gdb in the PATH environment variable. #### `--break-on-launch` Set the breakOnLaunch option for cuda-gdb. This makes the debugger break on the first instruction of every launched kernel. ### Compilation options #### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)` Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3. #### `-I ` Appends the given path to the list of directories to search for imported Mojo files. #### `-D ` Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`. #### `--debug-level `, `-g (LEVEL=full)` Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed. #### `--num-threads `, `-j` Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads). ### Target options #### `--target-triple ` Sets the compilation target triple. Defaults to the host target. #### `--target-cpu ` Sets the compilation target CPU. Defaults to the host CPU. #### `--target-features ` Sets the compilation target CPU features. Defaults to the host features. #### `--march ` Sets the architecture for which to generate code. #### `--mcpu ` Sets the CPU for which to generate code. #### `--mtune ` Sets the CPU for which to tune code. ### Compilation diagnostic options Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code. #### `--diagnose-missing-doc-strings` Emits diagnostics for missing or partial doc strings. #### `--validate-doc-strings` Emits errors for invalid doc strings instead of warnings. #### `--max-notes-per-diagnostic ` When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10. #### `--disable-builtins` Do not use builtins when create package. #### `--disable-warnings` Do not print warning messages. ### Debugger options #### `--X ` Passes ARG as an argument to the debugger when the debug session is launched using the debugger command-line interface. This option can be specified multiple times. It is ignored when using the RPC mode. ### Debug server options #### `--vscode` Launches the debug session on VS Code via the Mojo extension. #### `--rpc` Alias for --vscode. #### `--terminal ` The type of terminal to use when starting a launch debug session. * `console` (default): the debuggee will be launched in the default environment for the editor. If using VS Code, this will be the Debug Console. * `dedicated`: the debuggee will be launched in a dedicated terminal within the editor. #### `--port ` Uses the given PORT to communicate with the RPC debug server. Defaults to trying all ports from 12355 to 12364 inclusive. #### `--stop-on-entry` Automatically stop after launch. #### `--init-command ` Initialization command executed upon debugger startup. Can be specified multiple times. ### Experimental compilation options #### `--sanitize ` Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues). #### `--shared-libasan` Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option. #### `--debug-info-language ` Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo. ### Common options #### `--help`, `-h` Displays help information. --- ## Mojo decorators A Mojo decorator is a [higher-order function](https://en.wikipedia.org/wiki/Higher-order_function) that modifies or extends the behavior of a struct, a function, or some other code. Instead of actually calling the higher-order function, you simply add the decorator (such as the `@value` decorator) above your code (such as a struct). The Mojo compiler then uses the decorator function to modify your code at compile time. :::note No custom decorators The creation of custom decorators is not yet supported. The available ones are built directly into the compiler. ::: The following pages describe each built-in decorator with examples. :::🔥#docs ::: --- ## mojo demangle Demangles the given name. ## Synopsis ``` mojo demangle [options] ``` ## Description If the given name is a mangled Mojo symbol name, prints the demangled name. If no name is provided, one is read from standard input. ## Options ### Common options #### `--help`, `-h` Displays help information. --- ## mojo doc Compiles docstrings from a Mojo file. ## Synopsis ``` mojo doc [options] ``` ## Description This is an early version of a documentation tool that generates an API reference from Mojo code comments. Currently, it generates a structured output of all docstrings into a JSON file, and it does not generate HTML. This output format is subject to change. The input may be a single file or a directory. If you specify a directory, it will generate a single JSON output with documentation for all modules found in that path, recursively. ## Options ### Output options #### `-o ` Sets the path and filename for the JSON output. If not provided, output is written to stdout. ### Compilation options #### `-I ` Appends the given path to the list of directories that Mojo will search for any package/module dependencies. That is, if the file you pass to `mojo doc` imports any packages that do not reside in the local path and are not part of the Mojo standard library, use this to specify the path where Mojo can find those packages. ### Validation options The following validation options help ensure that your docstrings use valid structure and meet other style criteria. By default, warnings are emitted only if the docstrings contain errors that prevent translation to the output format. (More options coming later.) #### `--diagnose-missing-doc-strings` Emits diagnostics for missing or partial doc strings. #### `--validate-doc-strings` Emits errors for invalid doc strings instead of warnings. ### Compilation diagnostic options Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code. #### `--max-notes-per-diagnostic ` When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10. ### Common options #### `--diagnostic-format ` The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default). #### `--help`, `-h` Displays help information. --- ## Mojo documentation code examples # Mojo documentation code examples This directory includes code examples used in the Mojo Manual and related documentation at [docs.modular.com/mojo](/mojo). Reference solutions for Mojo tutorials can be found in the [`/examples/mojo`](../../examples/mojo) directory. The primary purpose of this directory is to enable automated testing of code examples. **Note:** Code examples in the API reference documentation for the Mojo Standard Library and other Modular open source libraries are embedded in the source files for those libraries and are not included here. ## Contributing If you see something in the documentation or the code examples that is incorrect or could be improved, we'd love to accept your contributions. At this time, code from this directory is **not** automatically included in the corresponding documentation file. If you contribute a change to a code example, please be sure to make a corresponding change to the copy of the code in the related documentation, as well as any explanatory text. Be aware that we don't provide tools to generate a preview of the website, because the Mojo docs are built along with other content that's not included in this repo. As such, we recommend you preview your edits in an IDE that can render Markdown and MDX files, such as VS Code, including the [VS Code environment in GitHub](https://github.dev/modular/modular/blob/main/). For more information about how to contribute, see the [Contributor Guide](../CONTRIBUTING.md). --- ## mojo format Formats Mojo source files. ## Synopsis ``` mojo format [options] ``` ## Description Formats the given set of Mojo sources using a Mojo-specific lint tool. ## Options ### Format options #### `--line-length `, `-l ` Sets the max character line length. Default is 80. ### Diagnostic options #### `--quiet`, `-q` Disables non-error messages. ### Common options #### `--help`, `-h` Displays help information. --- ## Mojo language basics This page provides an overview of the Mojo language. If you know Python, then a lot of Mojo code will look familiar. However, Mojo incorporates features like static type checking, memory safety, next-generation compiler technologies, and more. As such, Mojo also has a lot in common with languages like C++ and Rust. If you prefer to learn by doing, follow the [Get started with Mojo](/mojo/manual/get-started) tutorial. You'll install the [Magic](/magic) CLI, create a Mojo project and write your first Mojo program. On this page, we'll introduce the essential Mojo syntax, so you can start coding quickly and understand other Mojo code you encounter. Subsequent sections in the Mojo Manual dive deeper into these topics, and links are provided below as appropriate. Let's get started! 🔥 :::note Mojo is a young language and there are many [features still missing](/mojo/roadmap). As such, Mojo is currently **not** meant for beginners. Even this basics section assumes some programming experience. However, throughout the Mojo Manual, we try not to assume experience with any particular language. ::: ## Hello world Here's the traditional "Hello world" program in Mojo: ```mojo def main(): print("Hello, world!") ``` Every Mojo program must include a function named `main()` as the entry point. We'll talk more about functions soon, but for now it's enough to know that you can write `def main():` followed by an indented function body. The `print()` statement does what you'd expect, printing its arguments to the standard output. ## Variables In Mojo, you can declare a variable by simply assigning a value to a new named variable: ```mojo def main(): x = 10 y = x * x print(y) ``` You can also _explicitly_ declare variables with the `var` keyword: ```mojo var x = 10 ``` When declaring a variable with `var`, you can also declare a variable type, with or without an assignment: ```mojo def main(): var x: Int = 10 var sum: Int sum = x + x ``` Both implicitly declared and explicitly declared variables are statically typed: that is, the type is set at compile time, and doesn't change at runtime. If you don't specify a type, Mojo uses the type of the first value assigned to the variable. ```mojo x = 10 x = "Foo" # Error: Cannot convert "StringLiteral" value to "Int" ``` For more details, see the page about [variables](/mojo/manual/variables). ## Blocks and statements Code blocks such as functions, conditions, and loops are defined with a colon followed by indented lines. For example: ```mojo def loop(): for x in range(5): if x % 2 == 0: print(x) ``` You can use any number of spaces or tabs for your indentation (we prefer 4 spaces). All code statements in Mojo end with a newline. However, statements can span multiple lines if you indent the following lines. For example, this long string spans two lines: ```mojo def print_line(): long_text = "This is a long line of text that is a lot easier to read if" " it is broken up across two lines instead of one long line." print(long_text) ``` And you can chain function calls across lines: ```mojo def print_hello(): text = String(",") .join("Hello", " world!") print(text) ``` For more information on loops and conditional statements, see [Control flow](/mojo/manual/control-flow). ## Functions You can define a Mojo function using either the `def` or `fn` keyword. For example, the following uses the `def` keyword to define a function named `greet` that requires a single `String` argument and returns a `String`: ```mojo def greet(name: String) -> String: return "Hello, " + name + "!" ``` Where `def` and `fn` differ is error handling and argument mutability defaults. Refer to the [Functions](/mojo/manual/functions) page for more details on defining and calling functions. ## Code comments You can create a one-line comment using the hash `#` symbol: ```mojo # This is a comment. The Mojo compiler ignores this line. ``` Comments may also follow some code: ```mojo var message = "Hello, World!" # This is also a valid comment ``` API documentation comments are enclosed in triple quotes. For example: ```mojo fn print(x: String): """Prints a string. Args: x: The string to print. """ ... ``` Documenting your code with these kinds of comments (known as "docstrings") is a topic we've yet to fully specify, but you can generate an API reference from docstrings using the [`mojo doc` command](/mojo/cli/doc). :::note Technically, docstrings aren't _comments_, they're a special use of Mojo's syntax for multi-line string literals. For details, see [String literals](/mojo/manual/types#string-literals) in the page on [Types](/mojo/manual/types). ::: ## Structs You can build high-level abstractions for types (or "objects") as a `struct`. A `struct` in Mojo is similar to a `class` in Python: they both support methods, fields, operator overloading, decorators for metaprogramming, and so on. However, Mojo structs are completely static—they are bound at compile-time, so they do not allow dynamic dispatch or any runtime changes to the structure. (Mojo will also support Python-style classes in the future.) For example, here's a basic struct: ```mojo struct MyPair: var first: Int var second: Int fn __init__(out self, first: Int, second: Int): self.first = first self.second = second fn __copyinit__(out self, existing: Self): self.first = existing.first self.second = existing.second def dump(self): print(self.first, self.second) ``` And here's how you can use it: ```mojo def use_mypair(): var mine = MyPair(2, 4) mine.dump() ``` Note that some functions are declared with `fn` function, while the `dump()` function is declared with `def`. In general, you can use either form in a struct. The `MyPair` struct contains two special methods, `__init__()`, the constructor, and `__copyinit__()`, the copy constructor. _Lifecycle methods_ like this control how a struct is created, copied, moved, and destroyed. For most simple types, you don't need to write the lifecycle methods. You can use the `@value` decorator to generate the boilerplate code for you. So the `MyPair` struct can be simplified to this: ```mojo @value struct MyPair: var first: Int var second: Int def dump(self): print(self.first, self.second) ``` For more details, see the page about [structs](/mojo/manual/structs). ### Traits A trait is like a template of characteristics for a struct. If you want to create a struct with the characteristics defined in a trait, you must implement each characteristic (such as each method). Each characteristic in a trait is a "requirement" for the struct, and when your struct implements all of the requirements, it's said to "conform" to the trait. Using traits allows you to write generic functions that can accept any type that conforms to a trait, rather than accept only specific types. For example, here's how you can create a trait: ```mojo trait SomeTrait: fn required_method(self, x: Int): ... ``` The three dots following the method signature are Mojo syntax indicating that the method is not implemented. Here's a struct that conforms to `SomeTrait`: ```mojo @value struct SomeStruct(SomeTrait): fn required_method(self, x: Int): print("hello traits", x) ``` Then, here's a function that uses the trait as an argument type (instead of the struct type): ```mojo fn fun_with_traits[T: SomeTrait](x: T): x.required_method(42) fn use_trait_function(): var thing = SomeStruct() fun_with_traits(thing) ``` You'll see traits used in a lot of APIs provided by Mojo's standard library. For example, Mojo's collection types like `List` and `Dict` can store any type that conforms to the `Copyable` and `Movable` traits. You can specify the type when you create a collection: ```mojo my_list = List[Float64]() ``` :::note You're probably wondering about the square brackets on `fun_with_traits()`. These aren't function *arguments* (which go in parentheses); these are function *parameters*, which we'll explain next. ::: Without traits, the `x` argument in `fun_with_traits()` would have to declare a specific type that implements `required_method()`, such as `SomeStruct` (but then the function would accept only that type). With traits, the function can accept any type for `x` as long as it conforms to (it "implements") `SomeTrait`. Thus, `fun_with_traits()` is known as a "generic function" because it accepts a *generalized* type instead of a specific type. For more details, see the page about [traits](/mojo/manual/traits). ## Parameterization In Mojo, a parameter is a compile-time variable that becomes a runtime constant, and it's declared in square brackets on a function or struct. Parameters allow for compile-time metaprogramming, which means you can generate or modify code at compile time. Many other languages use "parameter" and "argument" interchangeably, so be aware that when we say things like "parameter" and "parametric function," we're talking about these compile-time parameters. Whereas, a function "argument" is a runtime value that's declared in parentheses. Parameterization is a complex topic that's covered in much more detail in the [Metaprogramming](/mojo/manual/parameters/) section, but we want to break the ice just a little bit here. To get you started, let's look at a parametric function: ```mojo def repeat[count: Int](msg: String): @parameter # evaluate the following for loop at compile time for i in range(count): print(msg) ``` This function has one parameter of type `Int` and one argument of type `String`. To call the function, you need to specify both the parameter and the argument: ```mojo def call_repeat(): repeat[3]("Hello") # Prints "Hello" 3 times ``` By specifying `count` as a parameter, the Mojo compiler is able to optimize the function because this value is guaranteed to not change at runtime. And the `@parameter` decorator in the code tells the compiler to evaluate the `for` loop at compile time, not runtime. The compiler effectively generates a unique version of the `repeat()` function that repeats the message only 3 times. This makes the code more performant because there's less to compute at runtime. Similarly, you can define a struct with parameters, which effectively allows you to define variants of that type at compile-time, depending on the parameter values. For more detail on parameters, see the section on [Metaprogramming](/mojo/manual/parameters/). ## Python integration Mojo supports the ability to import Python modules as-is, so you can leverage existing Python code right away. For example, here's how you can import and use NumPy: ```mojo from python import Python def main(): var np = Python.import_module("numpy") var ar = np.arange(15).reshape(3, 5) print(ar) print(ar.shape) ``` You must have the Python module (such as `numpy`) installed in the environment where you're using Mojo. You can install Python packages into your virtual environment using [Magic](/magic/) or [Conda](/magic/conda). For more details, see the page on [Python integration](/mojo/manual/python/). ## Next steps Hopefully this page has given you enough information to start experimenting with Mojo, but this is only touching the surface of what's available in Mojo. If you're in the mood to read more, continue through each page of this Mojo Manual—the next page from here is [Functions](/mojo/manual/functions). Otherwise, here are some other resources to check out: * See [Get started with Mojo](/mojo/manual/get-started) for a hands-on tutorial that will get you up and running with Mojo. * If you want to experiment with some code, clone [our GitHub repo](https://github.com/modular/modular/) to try our code examples: ```sh git clone https://github.com/modular/modular.git ``` ```sh cd max/examples/mojo ``` * To see all the available Mojo APIs, check out the [Mojo standard library reference](/mojo/lib). --- ## Mojo Manual Welcome to the Mojo Manual, a complete guide to the Mojo🔥 programming language! Mojo is designed to solve a variety of AI development challenges that no other language can, because Mojo is the first programming language built from the ground-up with [MLIR](https://mlir.llvm.org/) (a compiler infrastructure that's ideal for heterogeneous hardware, from CPUs and GPUs, to various AI ASICs). We also designed Mojo as the best way to extend Python because we love Python and its community, but we couldn't realistically enhance Python to do all the things we wanted. For a longer discussion on this topic, read [Why Mojo](/mojo/why-mojo). Beware that Mojo is still a very young language, so there's a lot that hasn't been built yet. Likewise, there's a lot of documentation that hasn't been written yet. But we're excited to share Mojo with you and [get your feedback](https://www.modular.com/community). ## Contents * **Get started** * [Why Mojo](/mojo/why-mojo) * [Get started with Mojo](/mojo/manual/get-started) * **Language basics** * [Overview](/mojo/manual/basics) * [Functions](/mojo/manual/functions) * [Variables](/mojo/manual/variables) * [Types](/mojo/manual/types) * [Operators and expressions](/mojo/manual/operators) * [Control flow](/mojo/manual/control-flow) * [Errors and context managers](/mojo/manual/errors) * [Structs](/mojo/manual/structs) * [Modules and packages](/mojo/manual/packages) * **Value ownership** * [Intro to value ownership](/mojo/manual/values/) * [Value semantics](/mojo/manual/values/value-semantics) * [Ownership](/mojo/manual/values/ownership) * [Lifetimes, origins, and references](/mojo/manual/values/lifetimes) * **Value lifecycle** * [Intro to value lifecycle](/mojo/manual/lifecycle/) * [Life of a value](/mojo/manual/lifecycle/life) * [Death of a value](/mojo/manual/lifecycle/death) * **Traits and parameters** * [Traits](/mojo/manual/traits) * [Parameterization: compile-time metaprogramming](/mojo/manual/parameters/) * **Pointers** * [Intro to pointers](/mojo/manual/pointers/) * [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers) * **GPU programming** * [Get started with GPU programming](/mojo/manual/gpu/intro-tutorial) * [GPU basics](/mojo/manual/gpu/basics) * **Layouts and LayoutTensor** * [Introduction to Layouts](/mojo/manual/layout/layouts) * **Python** * [Python integration](/mojo/manual/python/) * [Mojo calling Python](/mojo/manual/python/mojo-calling-python) * [Python calling Mojo](/mojo/manual/python/python-calling-mojo) * [Python types](/mojo/manual/python/types) * **Tools** * [Debugging](/mojo/tools/debugging) * [GPU debugging](/mojo/tools/debugging) * [Testing](/mojo/tools/testing) * **Project information** * [Roadmap and sharp edges](/mojo/roadmap) * [Changelog](/mojo/changelog) * [FAQ](/mojo/faq) --- ## mojo package Compiles a Mojo package. ## Synopsis ``` mojo package [options] ``` ## Description Compiles a directory of Mojo source files into a binary package suitable to share and import into other Mojo programs and modules. A Mojo package is portable across different systems because it includes only non-elaborated code (it's not an arch-specific package). The code becomes an arch-specific executable only after it's imported into a Mojo program that's then compiled with `mojo build`. To create a Mojo package, first add an `__init__.mojo` file to your package directory. Then pass that directory name to this command, and specify the output path and filename with `-o`. For more information, see [Mojo modules and packages](/mojo/manual/packages). ## Options ### Output options #### `-o ` Sets the path and filename for the output package. The filename must end with either `.mojopkg` or `.📦`. The filename given here defines the package name you can then use to import the code (minus the file extension). If you don't specify this option, a `.mojopkg` file is generated in the current working directory, with a name based on the name of the input directory. ### Compilation options #### `-I ` Appends the given path to the list of directories to search for imported Mojo files. #### `-kgenModule` Export as a KGEN module. ### Compilation diagnostic options Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code. #### `--diagnose-missing-doc-strings` Emits diagnostics for missing or partial doc strings. #### `--validate-doc-strings` Emits errors for invalid doc strings instead of warnings. #### `--max-notes-per-diagnostic ` When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10. #### `--disable-builtins` Do not use builtins when create package. #### `--disable-warnings` Do not print warning messages. ### Common options #### `--diagnostic-format ` The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default). #### `--help`, `-h` Displays help information. --- ## Mojo reference This section includes the Mojo API references: - [Standard library](#standard-library): Common Mojo APIs. - [MAX AI Kernels library](#max-ai-kernels-library). Mojo APIs for writing high-performance computational kernels and custom operations for AI models. - [MAX library](#max-library). MAX Mojo APIs, including tensor APIs for custom operations, and legacy MAX APIs. - [Decorators](#decorators). Mojo decorators reference. ## How to read the Mojo API docs Mojo syntax is covered in detail in the [Mojo manual](/mojo/manual/). Here's a quick cheat-sheet on reading struct and function signatures. ### Arguments Function arguments appear in parentheses after the function name: ```mojo fn example_fn(pos: Int, /, pos_or_kw: Int, *, kw_only: Bool = False): ... ``` Here's a quick overview of some special syntax in the argument list: - Slash (`/`): arguments declared before a slash are [positional-only arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments). - Star (`*`): a star by itself in place of an argument indicates that the arguments after the star are [keyword-only](/mojo/manual/functions#positional-only-and-keyword-only-arguments). - An equals sign (`=`) introduces a default value for an [optional argument](/mojo/manual/functions#optional-arguments). You may also see argument names prefixed with one or two stars (`*`): ```mojo def myfunc2(*names, **attributes): ``` - An argument name prefixed by a single star character, like `*names` identifies a [variadic argument](/mojo/manual/functions/#variadic-arguments). - An argument name prefixed with a double star, like `**attributes` identifies a [variadic keyword-only argument](/mojo/manual/functions/#variadic-keyword-arguments). An argument may also be preceded by an _argument convention_, which indicates how the value is passed: ```mojo fn sort(mut names: List[String]): ``` The most common conventions are: - `read`(default): the callee receives an **immutable reference** to the value. - `mut`: the callee receives a **mutable reference** to the value. - `owned`: the callee receives ownership of a value. For details and a complete list of argument conventions, see [Argument conventions](/mojo/manual/values/ownership#argument-conventions). ### Parameters Mojo structs and functions can take parameters. Parameters are evaluated at compilation time, and act as constants at runtime. Parameter lists are enclosed in square brackets: ```mojo struct ExampleStruct[size: Int, //, thing: Thing[size]]: ``` Parameters that occur before a double-slash (`//`) in the parameter list are [infer-only parameters](/mojo/manual/parameters/#infer-only-parameters). You usually don't need to specify infer-only parameters; as the name suggests, they're usually inferred. Like arguments, parameters can be positional-only, keyword-or-positional, or keyword-only, and they can be required or optional. The `/`, `*`, and `=` characters have the same meaning in parameter lists as they do in argument lists. ## Standard library The Mojo standard library provides nearly everything you'll need for writing Mojo programs, including basic data types like [`Int`](/mojo/stdlib/builtin/int/Int) and [`SIMD`](/mojo/stdlib/builtin/simd/SIMD), collection types like [`List`](/mojo/stdlib/collections/list/List), reusable [algorithms](/mojo/stdlib/algorithm/) and modules to support [GPU programming](/mojo/stdlib/gpu). Top-level packages: :::🔥#stdlib ::: ## MAX AI kernels library The MAX AI kernels library provides a collection of highly optimized, reusable compute kernels for high-performance numerical and AI workloads. These kernels serve as the foundational building blocks for writing [MAX custom operations](/max/custom-ops/) or standalone [GPU kernels](/mojo/manual/gpu/basics) that are portable across CPUs and GPUs. Top-level packages: :::🔥#kernels ::: ## MAX library The Mojo MAX library provides APIs to interact with the MAX graph compiler and runtime. Top-level packages: :::🔥#maxlib ::: ## Decorators A Mojo decorator is a higher-order function that modifies or extends the behavior of a struct, a function, or some other code. :::🔥#decorators ::: --- ## mojo repl Launches the Mojo REPL. ## Synopsis ``` mojo repl [lldb-options] ``` ## Description Launches a Mojo read-evaluate-print loop (REPL) environment, which provides interactive development in the terminal. You can also start the REPL by simply running `mojo`. Any number of options and arguments may be specified on the command line. These are then forwarded to the underlying lldb tool, which runs the REPL. ## Options ### Common options #### `--help`, `-h` Displays help information. --- ## mojo run Builds and executes a Mojo file. ## Synopsis ``` mojo run [options] [path-arguments...] ``` ## Description Compiles the Mojo file at the given path and immediately executes it. Another way to execute this command is to simply pass a file to `mojo`. For example: ``` mojo hello.mojo ``` Options for this command itself, such as the ones listed below, must appear before the input file `path` argument. Any command line arguments that appear after the Mojo source file `path` are interpreted as arguments for that Mojo program. ## Options ### Compilation options #### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)` Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3. #### `-I ` Appends the given path to the list of directories to search for imported Mojo files. #### `-D ` Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`. #### `--debug-level `, `-g (LEVEL=full)` Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed. #### `--num-threads `, `-j` Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads). ### Target options #### `--target-triple ` Sets the compilation target triple. Defaults to the host target. #### `--target-cpu ` Sets the compilation target CPU. Defaults to the host CPU. #### `--target-features ` Sets the compilation target CPU features. Defaults to the host features. #### `--march ` Sets the architecture for which to generate code. #### `--mcpu ` Sets the CPU for which to generate code. #### `--mtune ` Sets the CPU for which to tune code. ### Compilation diagnostic options Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code. #### `--diagnose-missing-doc-strings` Emits diagnostics for missing or partial doc strings. #### `--validate-doc-strings` Emits errors for invalid doc strings instead of warnings. #### `--max-notes-per-diagnostic ` When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10. #### `--disable-builtins` Do not use builtins when create package. #### `--disable-warnings` Do not print warning messages. ### Experimental compilation options #### `--sanitize ` Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues). #### `--shared-libasan` Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option. #### `--debug-info-language ` Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo. ### Common options #### `--diagnostic-format ` The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default). #### `--help`, `-h` Displays help information. --- ## mojo test Execute unit, integration, and documentation tests. ## Synopsis ``` mojo test [options] ``` ## Description Execute the given Mojo tests. ## Options ### Collection options #### `--collect-only`, `--co` Only collect tests, don't execute them. #### `--filter ` A POSIX extended regular expression regex string that will be used to filter test IDs. See for more information. ### Test run options #### `--debug` Launch a debugger session with `mojo debug`. This is not supported for docstring tests. Most debug flags are supported, including `--vscode`. ### Compilation options #### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)` Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3. #### `-I ` Appends the given path to the list of directories to search for imported Mojo files. #### `-D ` Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`. #### `--debug-level `, `-g (LEVEL=full)` Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed. #### `--num-threads `, `-j` Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads). ### Target options #### `--target-triple ` Sets the compilation target triple. Defaults to the host target. #### `--target-cpu ` Sets the compilation target CPU. Defaults to the host CPU. #### `--target-features ` Sets the compilation target CPU features. Defaults to the host features. #### `--march ` Sets the architecture for which to generate code. #### `--mcpu ` Sets the CPU for which to generate code. #### `--mtune ` Sets the CPU for which to tune code. ### Compilation diagnostic options Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code. #### `--diagnose-missing-doc-strings` Emits diagnostics for missing or partial doc strings. #### `--validate-doc-strings` Emits errors for invalid doc strings instead of warnings. #### `--max-notes-per-diagnostic ` When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10. #### `--disable-builtins` Do not use builtins when create package. #### `--disable-warnings` Do not print warning messages. ### Debugger options #### `--X ` Passes ARG as an argument to the debugger when the debug session is launched using the debugger command-line interface. This option can be specified multiple times. It is ignored when using the RPC mode. ### Debug server options #### `--vscode` Launches the debug session on VS Code via the Mojo extension. #### `--rpc` Alias for --vscode. #### `--terminal ` The type of terminal to use when starting a launch debug session. * `console` (default): the debuggee will be launched in the default environment for the editor. If using VS Code, this will be the Debug Console. * `dedicated`: the debuggee will be launched in a dedicated terminal within the editor. #### `--port ` Uses the given PORT to communicate with the RPC debug server. Defaults to trying all ports from 12355 to 12364 inclusive. #### `--stop-on-entry` Automatically stop after launch. #### `--init-command ` Initialization command executed upon debugger startup. Can be specified multiple times. ### Experimental compilation options #### `--sanitize ` Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues). #### `--shared-libasan` Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option. #### `--debug-info-language ` Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo. ### Common options #### `--diagnostic-format ` The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default). #### `--help`, `-h` Displays help information. --- ## Mojo🔥 changelog This is a list of changes to the Mojo language, standard library, and tools. To check your current version, run `mojo --version`. To update the version of Mojo for your project with the `magic` package manager, follow the instructions in [Update a package](/magic#update-a-package) to update the `max` package. ## v25.4 nightly This version is still a work in progress. See how to [install the nightly release](/max/packages#nightly-release). ### ✨ Highlights * The Python-Mojo bindings are available as a preview release! This is the ability to call into Mojo functions from existing Python codebases. The use case is to speed up hot spots/slow Python code by rewriting certain portions of your code in Mojo to achieve performance. * Parts of the Kernel library continue to be progressively open sourced! Packages that are open sourced now include: * `kv_cache` * `quantization` * `nvml` * Benchmarks * `Mogg` directory which contains registration of kernels with the Graph Compiler * Implicit trait conformance is deprecated. Each instance of implicit conformance results in a warning, but compilation still goes through. Soon it will be upgraded into an error. Any code currently relying on implicit conformance should either declare conformances explicitly or, if appropriate, replace empty, non-load-bearing traits with trait compositions. ### Language changes * The type [`Dict`](/mojo/stdlib/collections/dict/Dict/) is now part of the prelude, so there is no need to import them anymore. * The Mojo compiler will now synthesize `__moveinit__` and `__copyinit__` and `copy()` methods for structs that conform to `Movable`, `Copyable`, and `ExplicitlyCopyable` (respectively) but that do not implement the methods explicitly. * A new `@fieldwise_init` decorator can be attached to structs to synthesize a fieldwise initializer - an `__init__` method that takes the same arguments as the fields in the struct. This gives access to this helpful capability without having to opt into the rest of the methods that `@value` synthesizes. This decorator allows an optional `@fieldwise_init("implicit")` form for single-element structs, which marks the initializer as `@implicit`. * `try` and `raise` now work at comptime. * "Initializer lists" are now supported for creating struct instances with an inferred type based on context, for example: ```mojo fn foo(x: SomeComplicatedType): ... # Example with normal initializer. foo(SomeComplicatedType(1, kwarg=42)) # Example with initializer list. foo({1, kwarg=42}) ``` * List literals have been redesigned to work better. They produce homogenous sequences by invoking the `T(, __list_literal__: ())` constructor of a type `T` that is inferred by context, or otherwise defaulting to the standard library `List[Elt]` type. The `ListLiteral` type has been removed from the standard library. * Dictionary and set literals now work and default to creating instances of the `Dict` and `Set` types in the collections library. ### Standard library changes * The `CollectionElement` trait has been removed. * Added support for a wider range of consumer-grade hardware, including: * NVIDIA RTX 2060 GPUs * NVIDIA RTX 4090 GPUs * The `bitset` datastructure was added to the `collections` package. This is a fixed `bitset` that simplifies working with a set of bits and perform bit operations. * Fixed GPU `sum` and `prefix_sum` implementations in `gpu.warp` and `gpu.block` modules. Previously, the implementations have been incorrect and would either return wrong results or hang the kernel (due to the deadlock). [PR 4508](https://github.com/modular/modular/pull/4508) and [PR 4553](https://github.com/modular/modular/pull/4553) by [Kirill Bobyrev](https://github.com/kirillbobyrev) mitigate the found issues and add tests to ensure correctness going forward. Changes to Python-Mojo interoperability: * Python objects are now constructible with list/set/dict literal syntax, e.g.: `var list: PythonObject = [1, "foo", 2.0]` will produce a Python list containing other Python objects and `var d: PythonObject = {}` will construct an empty dictionary. * `Python.{unsafe_get_python_exception, throw_python_exception_if_error_state}` have been removed in favor of `CPython.{unsafe_get_error, get_error}`. * Since virtually any operation on a `PythonObject` can raise, the `PythonObject` struct no longer implements the `Indexer` and `Intable` traits. Instead, it now conforms to `IntableRaising`, and users should convert explictly to builtin types and handle exceptions as needed. In particular, the `PythonObject.__int__` method now returns a Python `int` instead of a mojo `Int`, so users must explicitly convert to a mojo `Int` if they need one (and must handle the exception if the conversion fails, e.g. due to overflow). * `PythonObject` no longer implements the following traits: * `Stringable`. Instead, the `PythonObject.__str__` method now returns a Python `str` object and can raise. The new `Python.str` function can also be used to convert an arbitrary `PythonObject` to a Python `str` object. * `KeyElement`. Since Python objects may not be hashable, and even if they are, could theoretically raise in the `__hash__` method, `PythonObject` cannot conform to `Hashable`. This has no effect on accessing Python `dict` objects with `PythonObject` keys, since `__getitem__` and `__setitem__` should behave correctly and raise as needed. Two overloads of the `Python.dict` factory function have been added to allow constructing dictionaries from a list of key-value tuples and from keyword arguments. * `EqualityComparable`. The `PythonObject.{__eq__, __ne__}` methods need to return other `PythonObject` values to support rich comparisons. Code that previously compared `PythonObject` values should be wrapped in `Bool(..)` to perform the fallible conversion explicitly: `if Bool(obj1 == obj2): ...`. * `Floatable`. An explicit, raising constructor is added to `SIMD` to allow constructing `Float64` values from `PythonObject` values that implement `__float__`. * `String` and `Bool` now implement `ConvertibleFromPython`. * A new `def_function` API is added to `PythonModuleBuilder` to allow declaring Python bindings for arbitrary functions that take and return `PythonObject`s. Similarly, a new `def_method` API is added to `PythonTypeBuilder` to allow declaring Python bindings for methods that take and return `PythonObject`s. * The `ConvertibleFromPython` trait is now public. This trait is implemented by Mojo types that can be constructed by converting from a `PythonObject`. This is the reverse operation of the `PythonConvertible` trait. * `PythonObject(alloc=)` is a new constructor that can be used to directly store Mojo values in Python objects. This initializer will fail if the type of the provided Mojo value has not previously had a corresponding Python 'type' object globally registered using `PythonModuleBuilder.add_type[T]()`. * `PythonObject` has new methods for downcasting to a pointer to a contained Mojo value, for use in Python/Mojo interop. ```mojo struct Person: var name: String fn greet(obj: PythonObject) raises: var person = obj.downcast_value_ptr[Person]() print("Hello ", person[].name, "from Mojo🔥!") ``` * `PythonObject.downcast_value_ptr[T]()` checks if the object is a wrapped instance of the Mojo type `T`, and if so, returns an `UnsafePointer[T]`. Otherwise, an exception is raised. * `PythonObject.unchecked_downcast_value_ptr[T]()` unconditionally returns an `UnsafePointer[T]` with any runtime type checking. This is useful when using Python/Mojo interop to optimize an inner loop and minimizing overhead is desirable. Also added equivalent `UnsafePointer` initializer for downcasting from a `PythonObject`. * The `Python.is_type(x, y)` static method has been removed. Use the expression `x is y` instead. * `os.abort(messages)` no longer supports generic variadic number of `Writable` messages. While this API was high-level and convenient, it generates a lot of IR for simple and common cases, such as when we have a single `StringLiteral` message. We now no longer need to generate a bunch of bloated IR, and instead, callers must create the `String` on their side before calling `os.abort(message)`. * The function `atof` has been entirely rewritten as it produced incorrect results for very low and very high exponents. It now works correctly for strings with less than 19 digits left of the `e`. For example `1.1385616158185648648648648648616186186e-3` won't work, and will raise an error. Anything that does not produce an error is now garanteed to be correct. While the current implementation is not the fastest, it's based on the paper [Number Parsing at a Gigabyte per Second](https://arxiv.org/abs/2101.11408) by Daniel Lemire. So with a bit of effort to pinpoints the slow parts, we can easily have state of the art performance in the future. ### Tooling changes * Added support for emitting LLVM Intermediate Representation (.ll) using `--emit=llvm`. * Example usage: `mojo build --emit=llvm YourModule.mojo` * Removing support for command line option `--emit-llvm` infavor of `--emit=llvm`. * Added support for emitting assembly code (.s) using `--emit-asm`. * Example usage: `mojo build --emit=asm YourModule.mojo` * Added `associated alias` support for documentation generated via `mojo doc`. ### 🛠️ Fixed * [#4352](https://github.com/modular/modular/issues/4352) - `math.sqrt` products incorrect results for large inputs. * [#4518](https://github.com/modular/modular/issues/4518) - Try Except Causes False Positive "Uninitialized Value". * [#4677](https://github.com/modular/modular/issues/4677), [#4688](https://github.com/modular/modular/issues/4668) - Incorrect result for unsigned `gt` and `le` comparisions. ## v25.3 (2025-05-06) ### ✨ Highlights * Parts of the Mojo standard library continue to be progressively open sourced! Packages that are open sourced now include: * `algorithm` * `benchmark` * `buffer` * `compile` * `complex` * `gpu` * `logger` * `runtime` * `subprocess` For more information, see the [Standard library reference](/mojo/lib#standard-library) and the [Standard library source](https://github.com/modular/modular/tree/main/mojo/stdlib). * Parts of the MAX AI kernels library continue to be progressively open sourced! Packages that are open sourced now include: * `layout` * `linalg` * `register` For more information, see the [MAX AI kernels library reference](/mojo/lib#max-ai-kernels-library) and the [MAX AI kernels source](https://github.com/modular/modular/tree/main/max/kernels). * Trait compositions are now supported via the `&` syntax. A trait composition combines two traits into one logical trait whose constraint set is the union of the constraint sets of the two original traits. For more information, see [Trait compositions](/mojo/manual/traits/#trait-compositions) in the Mojo Manual. * String types in Mojo got several significant improvements. See [Standard library changes](#25-3-standard-library-changes) for details. ### Language changes {#25-3-language-changes} * Mojo can now use [user-declared `__merge_with__()` dunder methods](https://github.com/modular/modular/blob/main/mojo/proposals/custom-type-merging.md) to merge values when using different types in ternary operations. This has been adopted to allow pointers to work naturally with the ternary operator, for example `var x = one_pointer if cond else other_pointer`. * Auto-parameterization now extends to struct metatypes. For example, this declaration `fn foo[M: __type_of(StringLiteral[_])]` will auto-parameterize on the unbound parameter of `StringLiteral`. * The Mojo compiler now warns about stores to values that are never used, e.g.: `x = foo(); x = bar()` will warn about the first assignment to `x` because it is overwritten. You can generally address this by deleting dead code, or by assigning to `_` instead: `_ = foo(); x = bar()`. You may also encounter this in variable declarations, e.g. `var x = 0; ...; x = foo()`. In this case, change the variable to being declared as uninitialized, e.g. `var x: Int`. You may also silence this warning entirely for a variable by renaming it to start with an underscore, e.g. `_x`. * The Mojo compiler now warns about obsolete use of `mut self` in initializers, please switch over to `fn __init__(out self)` instead. * `def` functions now require type annotations on arguments, and treat a missing return type as returning `None`. Previously these defaulted to the `object` type which led to a variety of problems. Support for `object` has been removed until we have time to investigate a proper replacement. ### Standard library changes {#25-3-standard-library-changes} String types in Mojo got several significant improvements: * The [`String`](/mojo/stdlib/collections/string/string/String/) type no longer copies data from [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/) and [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases) since they are known-static-constant values. This allows us to make construction from these values be implicit, which improves ergonomics and performance together. It also implements the "small string optimization", which avoids heap allocation for common short strings. On a 64-bit system, `String` can hold up to 23 bytes inline. Its copy constructor is now O(1), performing string data copy lazily on mutation. * The types \[`StringSlice`(/mojo/stdlib/collections/string/string\_slice/StringSlice/) and [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases) are now part of the prelude, there is no need to import them anymore. These are useful for code that just needs a "view" of string data, not to own and mutate it. * The [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/) type has been moved to a more reliable "dependent type" design where the value of the string is carried in a parameter instead of a stored member. This defines away a category of compiler crashes when working with `StringLiteral` that involved attempting to manipulate a `StringLiteral` at run time. As a consequence of this change, many APIs should switch to using [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases) instead of `StringLiteral`. For more information on this "dependent type" design for literals, see the proposal, [Fixing Simple Literals in Mojo](https://github.com/modular/modular/blob/main/mojo/proposals/fixing-simple-literals.md). * `String` supports a new `String(unsafe_uninit_length=x)` constructor and `str.resize(unsafe_uninit_length=x)` for clients that want to allocate space that they intend to fill in with custom unsafe initialization patterns. The `String(ptr=x, length=y)` constructor has been removed. * `String` supports working with legacy C APIs that assume null termination, but the details have changed: `String` is now no longer implicitly null-terminated, which means that it is incorrect to assume that `str.unsafe_ptr()` will return a null-terminated string. For that, use the `str.unsafe_cstr_ptr()` method. It now requires the string to be mutable in order to make null-termination lazy on demand. This improves performance for strings that are not passed to legacy APIs. * The [`List`](/mojo/stdlib/collections/list/List) type has been improved similarly to `String` to reduce inconsistency and enable power-user features, including removing adding `List(unsafe_uninit_length=x)` and `list.resize(unsafe_uninit_size=n)` methods avoid initialized memory that the caller plans to overwrite. * [`Set`](/mojo/stdlib/collections/set/Set/) now conforms to the [`Copyable`](/mojo/stdlib/builtin/value/Copyable/) trait so you can store sets in other types of collections (for example, as values in a `Dict`). * The following traits have been removed in favor of trait composition: `EqualityComparableCollectionElement`, `RepresentableCollectionElement`, `TestableCollectionElement`, `Testable`, `StringableIdentifiable`, `StringableCollectionElement`, `IntervalPayload`, `WritableCollectionElement`, `ComparableCollectionElement`, `BoolableCollectionElement`, `EqualityComparableWritableCollectionElement`, `EqualityComparableWritableCollectionElementNew`, `CollectionElementNew`, `WritableCollectionElementNew`. For example, you can replace `EqualityComparableCollectionElement` with `EqualityComparable & CollectionElement`. `StringableCollectionElement` was already deprecated and scheduled to be removed; it can be replaced with `Writable & CollectionElement`. * The [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) type is being reworked in preparation for some improvements to Mojo-Python interoperability: * Since virtually any operation on a `PythonObject` can raise, the `PythonObject` struct no longer implements the following traits: `ImplicitlyBoolable`, `ImplicitlyIntable`. * `PythonObject` is no longer implicitly constructible from tuple or list literals. For example, `var x : PythonObject = [1, 2, "foo"]` is no longer accepted. Instead, please use the new `Python.list()` and `Python.tuple()` factory methods. For example: ```mojo var x = Python.list(1, 2, "foo") ``` (The `list()` and `tuple()` factory methods were originally added on `PythonObject`, but have been moved to the `Python` struct.) We hope to re-enable literal syntax in the future as the standard library matures. * `PythonObject.from_borrowed_ptr()` has been removed in favor of a constructor with a keyword-only `from_borrowed_ptr` argument. * The deprecated `PythonObject.to_float64()` method has been removed. Use the `Float64()` constructor, instead. * [`Span`](/mojo/stdlib/memory/span/Span) now has a `swap_elements()` method which takes two indices and swaps them within the span. * [`Pointer`](/mojo/stdlib/memory/pointer/Pointer/) now has a `get_immutable()` method to return a new `Pointer` with the same underlying data but with an `ImmutableOrigin`. * You can now forward a [`VariadicPack`](/mojo/stdlib/builtin/list_literal/VariadicPack/) where all values are `Writable` to a writer using [`WritableVariadicPack`](/mojo/stdlib/utils/write/WritableVariadicPack/): ```mojo from utils.write import WritableVariadicPack fn print_message[*Ts: Writable](*messages: *Ts): print("message:", WritableVariadicPack(messages), "[end]") x = 42 print_message("'x = ", x, "'") ``` ```text message: 'x = 42' [end] ``` In this example the variadic pack is buffered to the stack in the `print` call along with the extra arguments, before doing a single syscall to write to stdout. * [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert/) in AMD GPU kernels now behaves the same as on NVIDIA, printing the thread information and variadic args passed after the condition: ```mojo from gpu.host import DeviceContext fn kernel(): var x = 1 debug_assert(x == 2, "x should be 2 but is: ", x) def main(): with DeviceContext() as ctx: ctx.enqueue_function[kernel](grid_dim=2, block_dim=2) ``` Running `mojo run -D ASSERT=all [filename]` will output: ```text At /tmp/test.mojo:5:17: block: [0,0,0] thread: [0,0,0] Assert Error: x should be 2 but is: 1 At /tmp/test.mojo:5:17: block: [0,0,0] thread: [1,0,0] Assert Error: x should be 2 but is: 1 At /tmp/test.mojo:5:17: block: [1,0,0] thread: [0,0,0] Assert Error: x should be 2 but is: 1 At /tmp/test.mojo:5:17: block: [1,0,0] thread: [1,0,0] Assert Error: x should be 2 but is: 1 ``` * The [`constrained[cond, string]()`](/mojo/stdlib/builtin/constrained/constrained/) function now accepts multiple strings that are printed concatenated on failure, so you can use: ```mojo constrained[cond, "hello: ", String(n), ": world"]() ``` This is more compile-time efficient and somewhat more ergonomic than using string concatenation. * [`pathlib.Path.write_text()`](/mojo/stdlib/pathlib/path/Path/#write_text) now accepts a `Writable` argument instead of a `Stringable` argument. This makes the function more efficient by removing a String allocation. * Added [`pathlib.Path.write_bytes()`](/mojo/stdlib/pathlib/path/Path/#write_bytes) which enables writing raw bytes to a file. * Added [`os.path.split_extension()`](/mojo/stdlib/os/path/path/split_extension) to split a path into its root and extension. * Added [`os.path.is_absolute()`](/mojo/stdlib/os/path/path/is_absolute) to check if a given path is absolute or not. * One can now specify the consistency model used in atomic operations with the default being sequential consistency. The consistency models are defined in the [`Consistency`](/mojo/stdlib/os/atomic/Consistency/) struct. * Added [`Variant.is_type_supported()`](/mojo/stdlib/utils/variant/Variant/#is_type_supported) method. ([PR #4057](https://github.com/modular/modular/pull/4057)) Example: ```mojo def takes_variant(mut arg: Variant): if arg.is_type_supported[Float64](): arg = Float64(1.5) def main(): var x = Variant[Int, Float64](1) takes_variant(x) if x.isa[Float64](): print(x[Float64]) # 1.5 ``` * The `type` parameter of `SIMD` has been renamed to `dtype`. * The `is_power_of_two(x)` function in the `bit` package is now a method on `Int`, `UInt` and `SIMD`. * The `Pointer.address_of(...)` and `UnsafePointer.address_of(...)` functions have been deprecated. Please use the [`Pointer(to=...)`](/mojo/stdlib/memory/pointer/Pointer#__init__) and [`UnsafePointer(to=...)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#__init__) constructors instead. Conceptually, this is saying "please initialize a `Pointer` (a reference, if you will) to *some other address in memory*. In the future, these `address_of()` functions will be removed. ### Tooling changes {#25-3-tooling-changes} * Fixed SIMD boolean display in debugger: SIMD boolean values now display correctly with proper bit extraction. * Improved language server performance: The language server now avoids parsing more than it needs to, improving performance across the board. * The Mojo compiler is now able to interpret all arithmetic operations from the `index` dialect that are used in methods of `Int` and `UInt` types. That allows users to finally compute constants at compile time: ```mojo alias a: Int = 1000000000 alias b: Int = (5 * a) // 2 ``` Previously, the compiler would throw the error "cannot fold operation". * Added a new `--emit-llvm` option to the `mojo build` command, which allows users to emit LLVM IR. When `--emit-llvm` is specified, the build process will: compile mojo code to LLVM IR, save the IR to a .ll file (using the same name as the input file), and print the IR to stdout for immediate inspection. ### Other changes * The syntax for adding attributes to an `__mlir_op` is now limited to inherent attributes (those defined by the op definition). Most users will not need to attach other kinds of attributes, and this helps guard against typos and mojo code getting outdated when the dialect changes. ### ❌ Removed {#25-3-removed} * The `SIMD.roundeven()` method has been removed from the standard library. This functionality is now handled by the [`round()`](/mojo/stdlib/builtin/math/round) function. * Error messages about the obsolete `borrowed` and `inout` keywords, as well as the obsolete `-> Int as name` syntax have been removed. * The `object` type has been removed. * `utils.numerics.ulp` has been removed. Use the [`ulp()`](/mojo/stdlib/math/math/ulp) function from the `math` package instead. * Several free functions that were deprecated in the 25.2 release have now been removed. This includes: * The `str` free function. Use the `String` constructor instead. * The `int` free function. Use the `Int` constructor instead. * The `bool` free function. Use the `Bool` constructor instead. * The `float` free function. Use the `Float64` constructor instead. * Removed deprecated [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext/) methods `copy_sync()` and `memset_sync()`. * The `unroll()` utility has been removed. Use the [`@parameter for` construct](/mojo/manual/decorators/parameter#parametric-for-statement) instead. ```mojo from utils.loop import unroll # Before @always_inline @parameter fn foo[i: Int](): body_logic[i]() unroll[foo, iteration_range]() # After @parameter for i in range(iteration_range): body_logic[i]() ``` * The `InlinedString` type has been removed. Use `String` instead which now supports the Small String Optimization (SSO). ### 🛠️ Fixed {#25-3-fixed} * [#3510](https://github.com/modular/modular/issues/3510) - `PythonObject` doesn't handle large `UInt64` correctly. * [#3847](https://github.com/modular/modular/issues/3847) - Count leading zeros can't be used on `SIMD` at compile time. * [#4198](https://github.com/modular/modular/issues/4198) - Apple M4 is not properly detected with `sys.is_apple_silicon()`. * [#3662](https://github.com/modular/modular/issues/3662) - Code using `llvm.assume` cannot run at compile time. * [#4273](https://github.com/modular/modular/issues/4273) - `count_leading_zeros` doesn't work for vectors with size > 1 at compile time. * [#4320](https://github.com/modular/modular/issues/4320) - Intermittent miscompilation with bytecode imported traits. * [#4281](https://github.com/modular/modular/issues/4281) - MAX does not support RTX 5000-series GPUs. * [#4163](https://github.com/modular/modular/issues/4163) - Corner case in initializers. * [#4360](https://github.com/modular/modular/issues/4360) - Fix constructor emission for parameterized types conforming to a trait composition. * [#4362](https://github.com/modular/modular/issues/4362) - Function call with `IntLiteral` incorrectly eliminated despite side-effects. * [#4431](https://github.com/modular/modular/issues/4431) - \[BUG] Python.evaluate doesn't handle null termination correctly. ### Special thanks Special thanks to our community contributors: [@auris](https://github.com/auris), [@bgreni](https://github.com/bgreni), [@christianbator](https://github.com/christianbator), [@KamilGucik](https://github.com/KamilGucik), [@kasmith11](https://github.com/kasmith11), [@martinvuyk](https://github.com/martinvuyk), [@ratulb](https://github.com/ratulb), [@rd4com](https://github.com/rd4com), [@sora](https://github.com/sora), [@thatstoasty](https://github.com/thatstoasty), and [@winding-lines](https://github.com/winding-lines). * [#4492](https://github.com/modular/modular/issues/4488) - Fix `StringSlice.replace` seg fault. ## v25.2 (2025-03-25) ### ✨ Highlights * Check out the new [GPU basics](/mojo/manual/gpu/basics) section of the [Mojo Manual](/mojo/manual) and the [Get started with GPU programming with Mojo and the MAX Driver](/mojo/manual/gpu/intro-tutorial) tutorial for a guide to getting started with GPU programming in Mojo! * Some APIs in the [`gpu`](/mojo/stdlib/gpu/) package were enhanced to simplify working with GPUs. * If you're executing a GPU kernel only once, you can now skip compiling it first before enqueueing it, and pass it directly to [`DeviceContext.enqueue_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_function). * The three separate methods on `DeviceContext` for asynchronously copying buffers between host and GPU memory have been combined to single overloaded [`enqueue_copy()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#enqueue_copy) method, and the three separate methods for synchronous copies have been combined into an overloaded [`copy_sync()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#copy_sync) method. * The `gpu.shuffle` module has been renamed to [`gpu.warp`](/mojo/stdlib/gpu/warp/) to better reflect its purpose. * The [`gpu`](/mojo/stdlib/gpu) package API documentation has been expanded, and API documentation for the [`layout`](/mojo/kernels/layout) package is underway, beginning with core types, functions, and traits. See the [Standard library changes](#25-2-standard-library-changes) section of the changelog for more information. * The legacy `borrowed`/`inout` keywords and `-> T as foo` syntax are no longer supported and now generate a compiler error. Please move to `read`/`mut`/`out` argument syntax instead. See [Argument conventions](/mojo/manual/values/ownership#argument-conventions) in the Mojo Manual for more information. * The standard library has many changes related to strings. Notably, the `Char` type has been renamed to [`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint), to better capture its intended purpose of storing a single Unicode codepoint. Additionally, related method and type names have been updated as well. See [Standard library changes](#25-2-standard-library-changes) for more details. * Support has been added for 128- and 256-bit signed and unsigned integers. This includes the [`DType`](/mojo/stdlib/builtin/dtype/DType) aliases `DType.int128`, `DType.uint128`, `DType.int256`, and `DType.uint256`, as well as [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) support for 128- and 256-bit signed and unsigned element types. Note that this exposes capabilities (and limitations) of LLVM, which may not always provide high performance for these types and may have missing operations like divide, remainder, etc. See [Standard library changes](#25-2-standard-library-changes) for more details. ### Language changes {#25-2-language-changes} * References to aliases in struct types with unbound (or partially) bound parameters sets are now allowed as long as the referenced alias doesn't depend on any unbound parameters: ```mojo struct StructWithParam[a: Int, b: Int]: alias a1 = 42 alias a2 = a+1 fn test(): _ = StructWithParams.a1 # ok _ = StructWithParams[1].a2 # ok _ = StructWithParams.a2 # error, 'a' is unbound. ``` * The Mojo compiler now warns about `@parameter for` with large loop unrolling factor (>1024 by default), which can lead to long compilation time and large generated code size. Set `--loop-unrolling-warn-threshold` to change default value to a different threshold or to `0` to disable the warning. * The Mojo compile-time interpreter can now handle many more LLVM intrinsics, including ones that return floating point values. This allows functions like [`round()`](/mojo/stdlib/builtin/math/round) to be constant folded when used in a compile-time context. * The Mojo compiler now has only one compile-time interpreter. It had two previously: one to handle a few cases that were important for dependent types in the parser (but which also had many limitations), and the primary one that ran at "instantiation" time which is fully general. This was confusing and caused a wide range of bugs. We've now removed the special case parse-time interpreter, replacing it with a more general solution for dependent types. This change should be invisible to most users, but should resolve a number of long-standing bugs and significantly simplifies the compiler implementation, allowing us to move faster. ### Standard library changes {#25-2-standard-library-changes} * [`Optional`](/mojo/stdlib/collections/optional/Optional), [`Span`](/mojo/stdlib/memory/span/Span), and [`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray) have been added to the prelude. You now no longer need to explicitly import these types to use them in your program. * GPU programming changes: * You can now skip compiling a GPU kernel first before enqueueing it, and pass it directly to [`DeviceContext.enqueue_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_function): ```mojo from gpu.host import DeviceContext fn func(): print("Hello from GPU") with DeviceContext() as ctx: ctx.enqueue_function[func](grid_dim=1, block_dim=1) ``` However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first with [`DeviceContext.compile_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#compile_function) and pass it to `DeviceContext.enqueue_function()` like this: ```mojo with DeviceContext() as ctx: var compiled_func = ctx.compile_function[func]() # Multiple kernel launches with the same function/parameters ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ``` * The following methods on [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext): * `enqueue_copy_to_device()` * `enqueue_copy_from_device()` * `enqueue_copy_device_to_device()` have been combined to a single overloaded [`enqueue_copy()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#enqueue_copy) method. Additionally, the methods: * `copy_to_device_sync()` * `copy_from_device_sync()` * `copy_device_to_device_sync()` have been combined into an overloaded [`copy_sync()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#copy_sync) method. * The `gpu.shuffle` module has been renamed to [`gpu.warp`](/mojo/stdlib/gpu/warp/) to better reflect its purpose. For example: ```mojo import gpu.warp as warp var val0 = warp.shuffle_down(x, offset) var val1 = warp.broadcast(x) ``` * Support has been added for 128- and 256-bit signed and unsigned integers. * The following aliases have been added to the [`DType`](/mojo/stdlib/builtin/dtype/DType) struct: `DType.int128`, `DType.uint128`, `DType.int256`, and `DType.uint256`. * The [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type now supports 128- and 256-bit signed and unsigned element types. Note that this exposes capabilities (and limitations) of LLVM, which may not always provide high performance for these types and may have missing operations like divide, remainder, etc. * The following [`Scalar`](/mojo/stdlib/builtin/simd/#aliases) aliases for 1-element `SIMD` values have been added: `Int128`, `UInt128`, `Int256`, and `UInt256`. * [`String`](/mojo/stdlib/collections/string) and friends: * The `Char` type has been renamed to [`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint), to better capture its intended purpose of storing a single Unicode codepoint. Additionally, related method and type names have been updated as well, including: * `StringSlice.chars()` and `String.chars()` to [`StringSlice.codepoints()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#codepoints) and [`String.codepoints()`](/mojo/stdlib/collections/string/string/String/#codepoints), respectively * `StringSlice.char_slices()` and `String.char_slices()` to [`StringSlice.codepoint_slices()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#codepoint_slices) and [`String.codepoint_slices()`](/mojo/stdlib/collections/string/string/String/#codepoint_slices), respectively * `CharsIter` to [`CodepointsIter`](/mojo/stdlib/collections/string/string_slice/CodepointsIter) * `Char.unsafe_decode_utf8_char()` to [`Codepoint.unsafe_decode_utf8_codepoint()`](/mojo/stdlib/collections/string/codepoint/Codepoint/#unsafe_decode_utf8_codepoint) * Made the iterator type returned by the string `codepoint_slices()` methods public as [`CodepointSliceIter`](/mojo/stdlib/collections/string/string_slice/CodepointSliceIter/). * [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) now supports several additional methods moved from [`String`](/mojo/stdlib/collections/string/string/String). The existing `String` methods have been updated to instead call the corresponding new `StringSlice` methods: * [`center()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#center) * [`is_ascii_digit()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#is_ascii_digit) * [`is_ascii_printable()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#is_ascii_printable) * [`islower()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#islower) * [`isupper()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#isupper) * [`ljust()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#ljust) * [`lower()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#lower) * [`rjust()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#rjust) * [`split()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#split) * [`upper()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#upper) * Added a [`StringSlice.is_codepoint_boundary()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#is_codepoint_boundary) method for querying if a given byte index is a boundary between encoded UTF-8 codepoints. * [`StringSlice.__getitem__(Slice)`](/mojo/stdlib/collections/string/string_slice/StringSlice/#__getitem__) now raises an error if the provided slice start and end positions do not fall on a valid codepoint boundary. This prevents construction of malformed `StringSlice` values, which could lead to memory unsafety or undefined behavior. For example, given a string containing multi-byte encoded data, like: ```mojo str_slice = "Hi👋!" ``` and whose in-memory and decoded data looks like: String Hi👋! Codepoint Characters H i 👋 ! Codepoints 72 105 128075 33 Bytes 72 105 240 159 145 139 33 Index 0 1 2 3 4 5 6 attempting to slice bytes `[3-5)` with `str_slice[3:5]` would previously erroneously produce a malformed `StringSlice` as output that did not correctly decode to anything: String invalid Codepoint Characters invalid Codepoints invalid Bytes 159 145 Index 0 1 The same statement will now raise an error informing the user that their indices are invalid. * The `StringLiteral.get[value]()` method, which converts a compile-time value of [`Stringable`](/mojo/stdlib/builtin/str/Stringable) type has been changed to a function named [`get_string_literal[value]()`](/mojo/stdlib/builtin/string_literal/get_string_literal). * Collections: * A new [`IntervalTree`](/mojo/stdlib/collections/interval/IntervalTree) data structure has been added to the standard library. This is a tree data structure that allows for efficient range queries. * Added an iterator to [`LinkedList`](/mojo/stdlib/collections/linked_list/LinkedList) ([PR \#4005](https://github.com/modular/modular/pull/4005)) * [`LinkedList.__iter__()`](/mojo/stdlib/collections/linked_list/LinkedList/#__iter__) to create a forward iterator. * [`LinkedList.__reversed__()`](/mojo/stdlib/collections/linked_list/LinkedList/#__reversed__) for a backward iterator. ```mojo var ll = LinkedList[Int](1, 2, 3) for element in ll: print(element[]) ``` * `List.bytecount()` has been renamed to [`List.byte_length()`](/mojo/stdlib/collections/list/List/#byte_length) for consistency with the string-like APIs. * The [`InlineArray(unsafe_uninitialized=True)`](/mojo/stdlib/collections/inline_array/InlineArray/#__init__) constructor is now spelled `InlineArray(uninitialized=True)`. * The design of the [`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral) and [`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral) types has been changed to maintain their compile-time-only value as a parameter instead of a stored field. This correctly models that infinite precision literals are not representable at runtime, and eliminates a number of bugs hit in corner cases. This is made possible by enhanced dependent type support in the compiler. * The `Buffer` struct has been removed in favor of [`Span`](/mojo/stdlib/memory/span/Span) and [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer). * The [`round()`](/mojo/stdlib/builtin/math/round) function is now fixed to perform "round half to even" (also known as "bankers' rounding") instead of "round half away from zero". * The [`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer/#alloc) method has changed to produce pointers with an empty `Origin` parameter, instead of with `MutableAnyOrigin`. This mitigates an issue with the any origin parameter extending the lifetime of unrelated local variables for this common method. * Several more packages are now documented: * [`compile`](/mojo/stdlib/compile) package * [`gpu`](/mojo/stdlib/gpu) package * [`layout`](/mojo/kernels/layout) package is underway, beginning with core types, functions, and traits * Added a new [`sys.is_compile_time()`](/mojo/stdlib/sys/compile/is_compile_time) function. This enables you to query whether code is being executed at compile time or not. For example: ```mojo from sys import is_compile_time fn check_compile_time() -> String: if is_compile_time(): return "compile time" else: return "runtime" def main(): alias var0 = check_compile_time() var var1 = check_compile_time() print("var0 is evaluated at ", var0, " , while var1 is evaluated at ", var1) ``` will print `var0 is evaluated at compile time, while var1 is evaluated at runtime`. ### Tooling changes {#25-2-tooling-changes} * Mojo API documentation generation is now able to display function and struct parameter references inside nested parametric types using names instead of indices. For example, instead of ```mojo sort[type: CollectionElement, //, cmp_fn: fn($1|0, $1|0) capturing -> Bool](span: Span[type, origin]) ``` it now displays ```mojo sort[type: CollectionElement, //, cmp_fn: fn(type, type) capturing -> Bool](span: Span[type, origin]) ``` ### ❌ Removed * Use of legacy argument conventions like `inout` and the use of `as` in named results now produces an error message instead of a warning. * Direct access to `List.size` has been removed. Use the public API instead. Examples: Extending a List: ```mojo base_data = List[Byte](1, 2, 3) data_list = List[Byte](4, 5, 6) ext_data_list = base_data.copy() ext_data_list.extend(data_list) # [1, 2, 3, 4, 5, 6] data_span = Span(List[Byte](4, 5, 6)) ext_data_span = base_data.copy() ext_data_span.extend(data_span) # [1, 2, 3, 4, 5, 6] data_vec = SIMD[DType.uint8, 4](4, 5, 6, 7) ext_data_vec_full = base_data.copy() ext_data_vec_full.extend(data_vec) # [1, 2, 3, 4, 5, 6, 7] ext_data_vec_partial = base_data.copy() ext_data_vec_partial.extend(data_vec, count=3) # [1, 2, 3, 4, 5, 6] ``` Slicing and extending a list efficiently: ```mojo base_data = List[Byte](1, 2, 3, 4, 5, 6) n4_n5 = Span(base_data)[3:5] extra_data = Span(List[Byte](8, 10)) end_result = List[Byte](capacity=len(n4_n5) + len(extra_data)) end_result.extend(n4_n5) end_result.extend(extra_data) # [4, 5, 8, 10] ``` * `InlinedFixedVector` and `InlineList` have been removed. Instead, use [`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray) when the upper bound is known at compile time. If the upper bound is not known until runtime, use [`List`](/mojo/stdlib/collections/list/List) with the `capacity` constructor to minimize allocations. ### 🛠️ Fixed * [#3976](https://github.com/modular/modular/issues/3976) The `variance` argument in [`random.randn_float64()`](/mojo/stdlib/random/random/randn_float64) and [`random.randn()`](/mojo/stdlib/random/random/randn) has been renamed to `standard_deviation` so that values are drawn from the correct distribution. ### Special thanks Special thanks to our community contributors: [@bgreni](https://github.com/bgreni), [@fnands](https://github.com/fnands), [@illiasheshyn](https://github.com/illiasheshyn), [@izo0x90](https://github.com/izo0x90), [@lydiandy](https://github.com/lydiandy), [@martinvuyk](https://github.com/martinvuyk), [@msaelices](https://github.com/msaelices), [@owenhilyard](https://github.com/owenhilyard), [@rd4com](https://github.com/rd4com), [@yinonburgansky](https://github.com/yinonburgansky) ## v25.1 (2025-02-13) ### ✨ Highlights * The legacy `borrowed`/`inout` keywords and `-> T as foo` syntax are deprecated and now generate a compiler warning. Please move to `read`/`mut`/`out` argument syntax instead. See [Argument conventions](/mojo/manual/values/ownership#argument-conventions) in the Mojo Manual for more information. * The `bool()`, `float()`, `int()`, and `str()` functions are deprecated and generate compiler warnings. Please use the `Bool()`, `Float64()`, `Int()`, and `String()` constructors instead. See [Standard library changes](#25-1-standard-library-changes) for more details. * The standard library has many changes related to strings. The new [`Char`](/mojo/stdlib/collections/string/codepoint/Codepoint) struct represents a single Unicode character, and includes several methods for categorizing character types. When iterating over the characters of a `String` with a `for` loop, you now should use the [`String.chars()`](/mojo/stdlib/collections/string/string/String#chars) method to provide an iterator of `Char` values or the [`String.char_slices()`](/mojo/stdlib/collections/string/string/String#char_slices) method to provide an iterator of [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice/) instances for each character. `StringRef` has been removed in favor of [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice/). And various functionality has moved from `String` and `StringLiteral` to the more general `StringSlice` type. See [Standard library changes](#25-1-standard-library-changes) for more details. * You can now use [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) constructors to cast existing `SIMD` values (including `Scalar` values) to a different type, though you can still use the [`SIMD.cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method to infer the size of the new vector. See [Standard library changes](#25-1-standard-library-changes) for more details. ### Language changes {#25-1-language-changes} * The legacy `borrowed`/`inout` keywords and `-> T as foo` syntax now generate a warning. Please move to `read`/`mut`/`out` argument syntax instead. See [Argument conventions](/mojo/manual/values/ownership#argument-conventions) in the Mojo Manual for more information. * Initializers are now treated as static methods that return an instance of `Self`. This means the `out` argument of an initializer is now treated the same as any other function result or `out` argument. This is generally invisible, except that patterns like `instance.__init__()` and `x.__copyinit__(y)` no longer work. Simply replace them with `instance = T()` and `x = y` respectively. * The [`@value`](/mojo/manual/decorators/value) decorator now additionally derives an implementation of the [`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable/) trait. This will ease the transition to explicit copyability requirements by default in the Mojo collection types. * Indexing into a homogenous tuple now produces the consistent element type without needing a rebind: ```mojo var x = (1, 2, 3, 3, 4) var y : Int = x[idx] # Just works! ``` * You can now overload positional arguments with a keyword-only argument, and keyword-only arguments with different names: ```mojo struct OverloadedKwArgs: var val: Int fn __init__(out self, single: Int): self.val = single fn __init__(out self, *, double: Int): self.val = double * 2 fn __init__(out self, *, triple: Int): self.val = triple * 3 fn main(): OverloadedKwArgs(1) # val=1 OverloadedKwArgs(double=1) # val=2 OverloadedKwArgs(triple=2) # val=6 ``` This also works with indexing operations: ```mojo struct OverloadedKwArgs: var vals: List[Int] fn __init__(out self): self.vals = List[Int](0, 1, 2) fn __getitem__(self, idx: Int) -> Int: return self.vals[idx] fn __getitem__(self, *, idx2: Int) -> Int: return self.vals[idx2 * 2] fn __setitem__(mut self, idx: Int, val: Int): self.vals[idx] = val fn __setitem__(mut self, val: Int, *, idx2: Int): self.vals[idx2 * 2] = val fn main(): var x = OverloadedKwArgs() print(x[1]) # 1 print(x[idx2=1]) # 2 x[1] = 42 x[idx2=1] = 84 print(x[1]) # 42 print(x[idx2=1]) # 84 ``` * The `__disable_del x` operation has been tightened up to treat all fields of `x` as consumed by the point of the deletion, so it should be used after all the subfields are transferred or otherwise consumed (for example, at the end of the function), not before uses of the fields. ### GPU programming {#25-1-gpu-programming} * The new [`gpu` package](/mojo/stdlib/gpu/) provides low-level programming constructs for working with GPUs. The Mojo `gpu` APIs allow you to manually manage interaction between the CPU host and GPU device, manage memory between devices, synchronize threads, and more. Currently the best way to use these APIs is from inside a [MAX custom operation](/max/custom-ops/). The following code example shows a GPU kernel written in Mojo: ```mojo from max.tensor import ManagedTensorSlice from gpu import thread_idx, block_dim, block_idx fn gpu_add_kernel(out: ManagedTensorSlice, x: ManagedTensorSlice[out.type, out.rank]): tid_x = thread_idx.x + block_dim.x * block_idx.x tid_y = thread_idx.y + block_dim.y * block_dim.y if tid_x Self` method. Previously, an initializer with the signature `fn __init__(out self, *, other: Self)` had been required by `ExplicitlyCopyable`. This improves the "greppability" and at-a-glance readability when a programmer is looking for places in their code that may be performing copies. * The `IntLike` trait has been removed and its functionality incorporated into the [`Indexer`](/mojo/stdlib/builtin/int/Indexer/) trait. This enables `SIMD` scalar integer types and `UInt` to be used for indexing into all of the collection types, as well as optimizing away normalization checks for `UInt` indexing. * The [`ImplicitlyIntable`](/mojo/stdlib/builtin/int/ImplicitlyIntable/) trait has been added, allowing types to be implicitly converted to an `Int` by implementing the `__as_int__()` method: ```mojo @value struct Foo(ImplicitlyIntable): var i: Int fn __as_int__(self) -> Int: return self.i ``` * You can now cast `SIMD` types using constructors: ```mojo var val = Int8(42) var cast = Int32(val) ``` It also works when passing a scalar type to larger vector size: ```mojo var vector = SIMD[DType.int64, 4](cast) # [42, 42, 42, 42] ``` For values other than scalars the size of the `SIMD` vector needs to be equal: ```mojo var float_vector = SIMD[DType.float64, 4](vector) ``` [`SIMD.cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) still exists to infer the size of new vector: ```mojo var inferred_size = float_vector.cast[DType.uint64]() # [42, 42, 42, 42] ``` * Added [`SIMD.from_bytes()`](/mojo/stdlib/builtin/simd/SIMD/#from_bytes) and [`SIMD.as_bytes()`](/mojo/stdlib/builtin/simd/SIMD/#as_bytes) to convert a list of bytes to a list of scalars and vice versa, accepting the endianess as an argument. Similar to Python `int.from_bytes()` and `int.to_bytes()` functions. * You can now use [`max()`](/mojo/stdlib/builtin/math/max) and [`min()`](/mojo/stdlib/builtin/math/min) with variadic number of arguments. * `bit_ceil()` has been renamed to [`next_power_of_two()`](/mojo/stdlib/bit/bit/next_power_of_two), and `bit_floor()` to [`prev_power_of_two()`](/mojo/stdlib/bit/bit/prev_power_of_two). This is to improve readability and clarity in their use. * Added a new boolean `validate` parameter to [`b64decode()`](/mojo/stdlib/base64/base64/b64decode). * The [`b64encode()`](/mojo/stdlib/base64/base64/b64encode) overload that previously took a `List` has been changed to take a [`Span`](/mojo/stdlib/memory/span/Span/). * Removed the `@implicit` decorator from some standard library initializer methods that perform allocation. This reduces places where Mojo code could implicitly allocate where the user may not be aware. Removed `@implicit` from: * `String.__init__(out self, StringSlice)` * `List.__init__(out self, owned *values: T)` * `List.__init__(out self, span: Span[T])` * Added more aliases in [`sys.ffi`](/mojo/stdlib/sys/ffi/) to round out the usual needs for FFI bindings. ### Tooling changes {#25-1-tooling-changes} * `mblack` (aka [`mojo format`](/mojo/cli/format)) no longer formats non-Mojo files. This prevents unexpected formatting of Python files. * Full struct signature information is now exposed in the documentation generator, and in the symbol outline and hover markdown via the Mojo Language Server. * The [`env_get_dtype()`](/mojo/stdlib/sys/param_env/env_get_dtype) function has been added to the [`sys.param_env`](/mojo/stdlib/sys/param_env/) module. This allows you to get the value of a `DType` from the param environment. ### ❌ Removed * `StringRef` has been removed. Use [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice/) instead. * Changed [`sys.argv()`](/mojo/stdlib/sys/arg/argv) to return list of `StringSlice`. * Added explicit [`Path()`](/mojo/stdlib/pathlib/path/Path/#__init__) constructor from `StringSlice`. * The `Tuple.get[i, T]()` method has been removed. Please use `tup[i]` or `rebind[T](tup[i])` as needed instead. * `StringableCollectionElement` is deprecated. Use [`WritableCollectionElement`](/mojo/stdlib/builtin/value/WritableCollectionElement/) instead, which still allows you to construct a `String`, but can avoid intermediate allocations. * The `IntLike` trait has been removed and its functionality incorporated into the [`Indexer`](/mojo/stdlib/builtin/int/Indexer/) trait. * The `Type{field1: 42, field2: 17}` syntax for direct initializing register passable types has been removed. This was legacy syntax - to upgrade your code, add the [`@value`](/mojo/manual/decorators/value) decorator to your struct to get a fieldwise initializer and use `Type(field1=42, field2 = 17)` instead. ### 🛠️ Fixed * The Mojo Kernel for Jupyter Notebooks is working again on nightly releases. * The command `mojo debug --vscode` now sets the current working directory properly. * [Issue #3796](https://github.com/modular/modular/issues/3796) - Compiler crash handling `for`-`else` statement. * [Issue #3540](https://github.com/modular/modular/issues/3540) - Using named output slot breaks trait conformance * [Issue #3617](https://github.com/modular/modular/issues/3617) - Can't generate the constructors for a type wrapping `!lit.ref` * The Mojo Language Server doesn't crash anymore on empty `__init__.mojo` files. [Issue #3826](https://github.com/modular/modular/issues/3826). * [Issue #3935](https://github.com/modular/modular/issues/3935) - Confusing OOM error when using `Tuple.get()` incorrectly. * [Issue #3955](https://github.com/modular/modular/issues/3955) - Unexpected copy behavior with `def` arguments in loops * [Issue #3960](https://github.com/modular/modular/issues/3960) - Infinite `for` loop ## v24.6 (2024-12-17) ### ✨ Highlights Here's a brief summary of some of the major changes in this release, with more detailed information in the following sections: * The `inout` and `borrowed` argument conventions have been renamed to `mut` and `read`, respectively. A new `out` convention has been added for the `self` argument in constructors and for named results. See [Language changes](#24-6-language-changes) for details. * `Lifetime` and related types in the standard library have been renamed to [`Origin`](/mojo/stdlib/builtin/type_aliases/Origin) to better clarify that parameters of this type indicate where a reference is derived from, not the more complicated notion of where a variable is initialized and destroyed. As a consequence the `__lifetime_of()` operator is now named `__origin_of()`. There are also a number of other origin-related improvements in this release, including being able to specify a union of origins by listing multiple values in the `__origin_of()` operator or inside the `ref` origin specifier (`ref [a, b]`). For details, see [Language changes](#24-6-language-changes). For background information and rationale on the name change see [the proposal](https://github.com/modular/modular/issues/3623). For more information on origins, see [Lifetimes, origins and references](/mojo/manual/values/lifetimes) in the Mojo Manual. * Implicit conversions are now opt-in using the [`@implicit`](/mojo/manual/decorators/implicit) decorator. See [Language changes](#24-6-language-changes) for details. * The standard library has added several new types, including [`Deque`](/mojo/stdlib/collections/deque/Deque) (a double-ended queue) and [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) (safe, single-owner, non-nullable smart pointer). See [Standard library changes](#24-6-standard-library-changes) for details. * The VS Code extension now supports setting data breakpoints and function breakpoints, and the Mojo LLDB debugger supports symbol breakpoints, such as `b main` or `b my_module::main`. * We've made a number of improvement to how information is displayed in error messages, LSP, and generated API documentation. For details, see [Tooling changes](#24-6-tooling-changes). * And we've added a number of new docs, including a brand new [Mojo tutorial](/mojo/manual/get-started), new pages on [operators and expressions](/mojo/manual/operators), [error handling](/mojo/manual/errors), and [pointers](/mojo/manual/pointers/), and many smaller additions and improvements. ### Language changes {#24-6-language-changes} * Argument convention changes: * The `inout` and `borrowed` argument conventions have been renamed to `mut` (for "mutate") and `read`, respectively. These verbs reflect what the callee can do to the argument value passed in by the caller, without requiring the programmer to know about advanced features like references. For information on Mojo's argument conventions, see [Argument conventions](/mojo/manual/values/ownership/#argument-conventions) in the Mojo Manual. * The argument convention for the `self` argument in the `__init__()`, `__copyinit__()`, and `__moveinit__()` methods has been changed from `inout` to `out`, reflecting that a constructor method initializes its `self` value without reading from it. This also enables spelling the type of an initializer correctly, which was not supported before: ```mojo struct Foo: fn __init__(out self): pass fn test(): # This works now var fnPtr : fn(out x: Foo)->None = Foo.__init__ var someFoo : Foo fnPtr(someFoo) # initializes someFoo. ``` The previous `fn __init__(inout self)` syntax is still supported in this release of Mojo, but will be removed in the future. Please migrate to the new syntax. * Similarly, the spelling of named results has switched to use `out` syntax instead of `-> T as name`. Functions may have at most one named result or return type specified with the usual `->` syntax. `out` arguments may occur anywhere in the argument list, but are typically last (except for `__init__` methods, where they are typically first). ```mojo # This function has type "fn() -> String" fn example(out result: String): result = "foo" ``` The parser still accepts the old syntax as a synonym for this, but that will eventually be deprecated and removed. This was [discussed extensively in a public proposal](https://github.com/modular/modular/issues/3623). For more information, see [Named results](/nightly/mojo/manual/functions#named-results) in the Mojo Manual. * Single argument constructors now require the [`@implicit`](/mojo/manual/decorators/implicit) decorator to allow for implicit conversions. Previously you could define an `__init__` that takes a single argument: ```mojo struct Foo: var value: Int fn __init__(out self, value: Int): self.value = value ``` And this would allow you to pass an `Int` in the position of a `Foo`: ```mojo fn func(foo: Foo): print("implicitly converted Int to Foo:", foo.value) fn main(): func(Int(42)) ``` This can result in complicated errors that are difficult to debug. By default this implicit behavior is now turned off, so you have to explicitly construct `Foo`: ```mojo fn main(): func(Foo(42)) ``` You can still opt into implicit conversions by adding the `@implicit` decorator. For example, to enable implicit conversions from `Int` to `Foo`: ```mojo struct Foo: var value: Int @implicit fn __init__(out self, value: Int): self.value = value ``` For more information see [Constructors and implicit conversion](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion) in the Mojo Manual. * Origin-related changes: * The `AnyLifetime` type (useful for declaring origin types as parameters) has has been renamed to [`Origin`](/mojo/stdlib/builtin/type_aliases/Origin) and the `__lifetime_of()` operator renamed to `__origin_of()`. * `Origin` is now a complete wrapper around the MLIR origin type. * The `Origin.type` alias has been renamed to `_mlir_origin`. In parameter lists, you can now write just `Origin[..]`, instead of `Origin[..].type`. * `ImmutableOrigin` and `MutableOrigin` are now, respectively, just aliases for `Origin[False]` and `Origin[True]`. * `Origin` struct values are now supported in the origin specifier of a `ref [..]` argument. * Added `Origin.cast_from` for casting the mutability of an origin value. * `ref` arguments and results now allow for providing a memory value directly in the origin specifier, rather than requiring the use of `__origin_of()`. It is still fine to use `__origin_of()` explicitly though, and this is required when specifying origins for parameters (e.g. to the `Pointer` type). For example, this is now valid without `__origin_of()`: ```mojo fn return_ref(a: String) -> ref [a] String: return a ``` * Various improvements to origin handling and syntax have landed, including support for the ternary operator and allowing multiple arguments in a `ref` specifier (which are implicitly unions). This enables expression of simple algorithms cleanly: ```mojo fn my_min[T: Comparable](ref a: T, ref b: T) -> ref [a, b] T: return a if a ref [a] String: return a ``` * The `__type_of(x)` and `__origin_of(x)` operators are much more general now: they allow arbitrary expressions inside of them, allow referring to dynamic values in parameter contexts, and even allow referring to raising functions in non-raising contexts. These operations never evaluate their expression, so any side effects that occur in the expression are never evaluated at runtime, eliminating concerns about `__type_of(expensive())` being a problem. * The destructor insertion logic in Mojo is now aware that types that take an `MutableAnyOrigin` or `ImmutableAnyOrigin` as part of their signature could potentially access any live value that destructor insertion is tracking, eliminating a significant usability issue with unsafe APIs like [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer). Consider a typical example working with strings before this change: ```mojo var str = String(...) var ptr = str.unsafe_ptr() some_low_level_api(ptr) _ = str^ # OLD HACK: Explicitly keep string alive until here! ``` The `_ = str^` pattern was formerly required because the Mojo compiler has no idea what "ptr" might reference. As a consequence, it had no idea that `some_low_level_api()` might access `str` and therefore thought it was ok to destroy the `String` before the call - this is why the explicit lifetime extension was required. Mojo now knows that [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) may access the `MutableAnyOrigin` origin, and now assumes that any API that uses that origin could use live values. In this case, it assumes that `some_low_level_api()` might access `str` and because it might be using it, it cannot destroy `str` until after the call. The consequence of this is that the old hack is no longer needed for these cases! * Function types now accept an origin set parameter. This parameter represents the origins of values captured by a parameter closure. The compiler automatically tags parameter closures with the right set of origins. This enables lifetimes and parameter closures to correctly compose. ```mojo fn call_it[f: fn() capturing [_] -> None](): f() fn test(): var msg = String("hello world") @parameter fn say_hi(): print(msg) call_it[say_hi]() # no longer need to write `_ = msg^`!! ``` Note that this only works for higher-order functions which have explicitly added `[_]` as the capture origins. By default, the compiler still assumes a `capturing` closure does not reference any origins. This will soon change. * Infer-only parameters may now be explicitly bound with keywords, enabling some important patterns in the standard library: ```mojo struct StringSlice[is_mutable: Bool, //, origin: Origin[is_mutable]]: ... alias ImmStringSlice = StringSlice[is_mutable=False] # This auto-parameterizes on the origin, but constrains it to being an # immutable slice instead of a potentially mutable one. fn take_imm_slice(a: ImmStringSlice): ... ``` * The flag for turning on asserts has changed, e.g. to enable all checks: ```bash mojo -D ASSERT=all main.mojo ``` The levels are: * `none`: all assertions off * `warn`: print assertion errors e.g. for multithreaded tests (previously `-D ASSERT_WARNING`) * `safe`: the default mode for standard CPU safety assertions * `all`: turn on all assertions (previously `-D MOJO_ENABLE_ASSERTIONS`) You can now also pass `Stringable` args to format a message, which will have no runtime penalty or IR bloat cost when assertions are off. Previously you had to: ```mojo x = -1 debug_assert( x > 0, String.format_sequence(“expected x to be more than 0 but got: ”, x) ) ``` Which can't be optimized away by the compiler in release builds, you can now pass multiple args for a formatted message at no runtime cost: ```mojo debug_assert(x > 0, “expected x to be more than 0 but got: ”, x) ``` * Automatic parameterization of parameters is now supported. Specifying a parameterized type with unbound parameters causes them to be implicitly added to the function signature as infer-only parameters. ```mojo fn foo[value: SIMD[DType.int32, _]](): pass # Equivalent to fn foo[size: Int, //, value: SIMD[DType.int32, size]](): pass ``` * Mojo can now interpret simple LLVM intrinsics in parameter expressions, enabling things like `count_leading_zeros` to work at compile time: [Issue #933](https://github.com/modular/modular/issues/933). * Introduced the `@explicit_destroy` annotation, the `__disable_del` keyword, the `UnknownDestructibility` trait, and the `ImplicitlyDestructible` keyword, for the experimental explicitly destroyed types feature. * Added associated types; we can now have aliases like `alias T: AnyType`, `alias N: Int`, etc. in a trait, and then specify them in structs that conform to that trait. For more information, see [Associated aliases for generics](/mojo/manual/traits#associated-aliases-for-generics). ### Standard library changes {#24-6-standard-library-changes} * Introduced a new [`Deque`](/mojo/stdlib/collections/deque/Deque) (double-ended queue) collection type, based on a dynamically resizing circular buffer for efficient O(1) additions and removals at both ends as well as O(1) direct access to all elements. The `Deque` supports the full Python `collections.deque` API, ensuring that all expected deque operations perform as in Python. Enhancements to the standard Python API include `peek()` and `peekleft()` methods for non-destructive access to the last and first elements, and advanced constructor options (`capacity`, `min_capacity`, and `shrink`) for customizing memory allocation and performance. These options allow for optimized memory usage and reduced buffer reallocations, providing flexibility based on application requirements. * The `Formatter` struct has been replaced with a [`Writer`](/mojo/stdlib/utils/write/Writer) trait to enable buffered IO, increasing print and file writing perf to the same speed as C. It's now more general purpose and can write any `Span[Byte]`. To align with this the `Formattable` trait is now named [`Writable`](/mojo/stdlib/utils/write/Writable), and the `String.format_sequence()` static method to initialize a new `String` has been renamed to [`String.write()`](/mojo/stdlib/collections/string/string/String/#write). Here's an example of using all of the changes: ```mojo from memory import Span @value struct NewString(Writer, Writable): var s: String # Writer requirement to write a Span of Bytes fn write_bytes(inout self, bytes: Span[Byte, _]): self.s._iadd[False](bytes) # Writer requirement to take multiple args fn write[*Ts: Writable](inout self, *args: *Ts): @parameter fn write_arg[T: Writable](arg: T): arg.write_to(self) args.each[write_arg]() # Also make it Writable to allow `print` to write the inner String fn write_to[W: Writer](self, inout writer: W): writer.write(self.s) @value struct Point(Writable): var x: Int var y: Int # Pass multiple args to the Writer. The Int and StringLiteral types call # `writer.write_bytes` in their own `write_to` implementations. fn write_to[W: Writer](self, inout writer: W): writer.write("Point(", self.x, ", ", self.y, ")") # Enable conversion to a String using `str(point)` fn __str__(self) -> String: return String.write(self) fn main(): var point = Point(1, 2) var new_string = NewString(str(point)) new_string.write("\n", Point(3, 4)) print(new_string) ``` ```output Point(1, 2) Point(3, 4) ``` * Python interop changes: * Introduced [`TypedPythonObject`](/mojo/stdlib/python/python_object/TypedPythonObject) as a light-weight way to annotate [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) values with static type information. This design will likely evolve and change significantly. * Added `TypedPythonObject["Tuple].__getitem__()` for accessing the elements of a Python tuple. * Added [`Python.add_object()`](/mojo/stdlib/python/python/Python#add_object), to add a named `PythonObject` value to a Python 'module' object instance. * Added [`Python.unsafe_get_python_exception()`](/mojo/stdlib/python/python/Python#unsafe_get_python_exception), as an efficient low-level utility to get the Mojo `Error` equivalent of the current CPython error state. * Add [`PythonObject.from_borrowed_ptr()`](/mojo/stdlib/python/python_object/PythonObject#from_borrowed_ptr), to simplify the construction of `PythonObject` values from CPython 'borrowed reference' pointers. The existing `PythonObject.__init__(PyObjectPtr)` should continue to be used for the more common case of constructing a `PythonObject` from a 'strong reference' pointer. * Support for multi-dimensional indexing and slicing for `PythonObject` (PR [#3549](https://github.com/modular/modular/pull/3549), PR [#3583](https://github.com/modular/modular/pull/3583)). ```mojo var np = Python.import_module("numpy") var a = np.array(PythonObject([1,2,3,4,5,6])).reshape(2,3) print((a[0, 1])) # 2 print((a[1][::-1])) # [6 5 4] ``` Note that the syntax, `a[1, ::-1]`, is currently not supported. * Added [`PythonObject.__contains__()`](/mojo/stdlib/python/python_object/PythonObject#__contains__). ([PR #3101](https://github.com/modular/modular/pull/3101)) Example usage: ```mojo x = PythonObject([1,2,3]) if 1 in x: print("1 in x") ``` * Pointer related changes: * The [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) type now has an `origin` parameter that can be used when the `UnsafePointer` points to a value with a known origin. This origin is propagated through the `ptr[]` indirection operation. This parameter and other `UnsafePointer` parameters (other than the type) are now keyword-only. * You can now index into `UnsafePointer` using `SIMD` scalar integral types: ```mojo p = UnsafePointer[Int].alloc(1) i = UInt8(1) p[i] = 42 print(p[i]) ``` * Added a new [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) type as a safe, single-owner, non-nullable smart pointer with similar semantics to Rust's [`Box`](https://doc.rust-lang.org/std/boxed/struct.Box.html) and C++'s [`std::unique_ptr`](https://en.cppreference.com/w/cpp/memory/unique_ptr). ([PR #3524](https://github.com/modular/modular/pull/3524)) * `Arc` has been renamed to [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer), for consistency with `OwnedPointer`. * [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer) now implements [`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable), and can be compared for pointer equivalence using `a is b`. * The `Reference` type has been renamed to [`Pointer`](/mojo/stdlib/memory/pointer/Pointer): a memory safe complement to `UnsafePointer`. This change is motivated by the fact that `Pointer` is assignable and requires an explicit dereference with `ptr[]`. Renaming to `Pointer` clarifies that "references" means `ref` arguments and results, and gives us a model that is more similar to what the C++ community would expect. For an overview of Mojo's pointer types, see the new [Intro to pointers](/mojo/manual/pointers/) page in the Mojo Manual. * A new [`as_noalias_ptr()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#as_noalias_ptr) method as been added to `UnsafePointer`. This method specifies to the compiler that the resultant pointer is a distinct identifiable object that does not alias any other memory in the local scope. * Added the [`Floatable`](/mojo/stdlib/builtin/floatable/Floatable) and [`FloatableRaising`](/mojo/stdlib/builtin/floatable/FloatableRaising) traits to denote types that can be converted to a `Float64` value using the builtin `float` function. Made `SIMD` and `FloatLiteral` conform to the `Floatable` trait. ([PR #3163](https://github.com/modular/modular/pull/3163)) ```mojo fn foo[F: Floatable](v: F): ... var f = float(Int32(45)) ``` * The [`rebind()`](/mojo/stdlib/builtin/rebind/rebind) standard library function now works with memory-only types in addition to `@register_passable("trivial")` ones, without requiring a copy. For more information, see [The `rebind()` builtin](/mojo/manual/parameters/#the-rebind-builtin) in the Mojo Manual. * Introduced the [`random.shuffle()`](/mojo/stdlib/random/random/shuffle) function for randomizing the elements of a `List`. ([PR #3327](https://github.com/modular/modular/pull/3327)) Example: ```mojo from random import shuffle var l = List[Int](1, 2, 3, 4, 5) shuffle(l) ``` * The [`Dict.__getitem__()`](/mojo/stdlib/collections/dict/Dict#__getitem__) method now returns a reference instead of a copy of the value (or raises). This improves the performance of common code that uses `Dict` by allowing borrows from the `Dict` elements. * [`Slice.step`](/mojo/stdlib/builtin/builtin_slice/Slice#fields) is now an `Optional[Int]`, matching the optionality of `slice.step` in Python. ([PR #3160](https://github.com/modular/modular/pull/3160)) * There is now a [`Byte`](/mojo/stdlib/builtin/simd/#aliases) alias to better express intent when working with a pack of bits. ([PR #3670](https://github.com/modular/modular/pull/3670)). * Expanded [`os.path`](/mojo/stdlib/os/path/path/) with new functions: * `os.path.expandvars()`: Expands environment variables in a path ([PR #3735](https://github.com/modular/modular/pull/3735)). * `os.path.splitroot()`: Split a path into drive, root and tail. ([PR #3780](https://github.com/modular/modular/pull/3780)). * Added a [`reserve()`](/mojo/stdlib/collections/string/string/String#reserve) method and new constructor to the `String` struct to allocate additional capacity. ([PR #3755](https://github.com/modular/modular/pull/3755)). * A new [`StringLiteral.get[some_stringable]()`](/mojo/stdlib/builtin/string_literal/StringLiteral#get) method is available. It allows forming a runtime-constant `StringLiteral` from a compile-time-dynamic `Stringable` value. * [`Span`](/mojo/stdlib/memory/span/Span) has moved from the `utils` module to the `memory` module. * [`Span`](/mojo/stdlib/memory/span/Span) now implements `__reversed__()`. This means that one can get a reverse iterator over a `Span` using `reversed(my_span)`. Users should currently prefer this method over `my_span[::-1]`. * A new [`AsBytes`](/mojo/stdlib/memory/span/AsBytes) trait has been added to enable taking a `Span[Byte]` from any type that implements `as_bytes()`. `String.as_bytes()` and `String.as_bytes_slice()` have been consolidated under `String.as_bytes()` to return a `Span[Byte]`. If you require a copy, you can convert the `Span` to a `List` with `List(my_string.as_bytes())`. * [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) now implements `strip()`, `rstrip()`, and `lstrip()`. * [`StringRef`](/mojo/stdlib/collections/string/string_slice/StringSlice) now implements `split()` which can be used to split a `StringRef` into a `List[StringRef]` by a delimiter. ([PR \#2705](https://github.com/modular/modular/pull/2705)) * [`StringRef`](/mojo/stdlib/collections/string/string_slice/StringSlice) is now representable so `repr(StringRef("hello"))` will return `StringRef('hello')`. * More things have been removed from the auto-exported set of entities in the `prelude` module from the Mojo standard library: * `UnsafePointer` has been removed. Please explicitly import it via `from memory import UnsafePointer`. * `StringRef` has been removed. Please explicitly import it via `from utils import StringRef`. * Restored implicit copyability of [`Tuple`](/mojo/stdlib/builtin/tuple/Tuple) and [`ListLiteral`](/mojo/stdlib/builtin/list_literal/ListLiteral). * The [aliases for C foreign function interface (FFI)](/mojo/stdlib/sys/ffi/#aliases) have been renamed: `C_int` -> `c_int`, `C_long` -> `c_long` and so on. * `Float32` and `Float64` are now printed and converted to strings with roundtrip guarantee and shortest representation: ```plaintext Value Old New Float64(0.3) 0.29999999999999999 0.3 Float32(0.3) 0.30000001192092896 0.3 Float64(0.0001) 0.0001 0.0001 Float32(0.0001) 9.9999997473787516e-05 0.0001 Float64(-0.00001) -1.0000000000000001e-05 -1e-05 Float32(-0.00001) -9.9999997473787516e-06 -1e-05 Float32(0.00001234) 1.2339999557298142e-05 1.234e-05 Float32(-0.00000123456) -1.2345600453045336e-06 -1.23456e-06 Float64(1.1234567e-320) 1.1235052786429946e-320 1.1235e-320 Float64(1.234 * 10**16) 12340000000000000.0 1.234e+16 ``` * The `StaticIntTuple` data structure in the `utils` package has been renamed to [`IndexList`](/mojo/stdlib/utils/index_/IndexList). The data structure now allows one to specify the index bitwidth of the elements along with whether the underlying indices are signed or unsigned. * Added [`DLHandle.get_symbol()`](/mojo/stdlib/sys/ffi/DLHandle#get_symbol), for getting a pointer to a symbol in a dynamic library. This is more general purpose than the existing methods for getting function pointers. ### Tooling changes {#24-6-tooling-changes} * The VS Code Mojo Debugger now has a `buildArgs` JSON debug configuration setting that can be used in conjunction with `mojoFile` to define the build arguments when compiling the Mojo file. * The VS Code extension now supports a `Configure Build and Run Args` command that helps set the build and run args for actions file `Run Mojo File` and `Debug Mojo File`. A corresponding button appears in `Run and Debug` selector in the top right corner of a Mojo File. * The VS Code extension now has the `mojo.run.focusOnTerminalAfterLaunch` setting, which controls whether to focus on the terminal used by the `Mojo: Run Mojo File` command or on the editor after launch. [Issue #3532](https://github.com/modular/modular/issues/3532). * The VS Code extension now has the `mojo.SDK.additionalSDKs` setting, which allows the user to provide a list of MAX SDKs that the extension can use when determining a default SDK to use. The user can select the default SDK to use with the `Mojo: Select the default MAX SDK` command. * The VS Code extension now supports setting [data breakpoints](https://code.visualstudio.com/docs/editor/debugging#_data-breakpoints) as well as [function breakpoints](https://code.visualstudio.com/docs/editor/debugging#_function-breakpoints). * The Mojo LLDB debugger now supports symbol breakpoints, for example, `b main` or `b my_module::main`. * Error messages that include type names no longer include inferred or defaulted parameters when they aren't needed. For example, previously Mojo complained about things like: ```plaintext ... cannot be converted from 'UnsafePointer[UInt, 0, _default_alignment::AnyType](), MutableAnyOrigin]' to 'UnsafePointer[Int, 0, _default_alignment[::AnyType](), MutableAnyOrigin]' ``` it now complains more helpfully that: ```plaintext ... cannot be converted from 'UnsafePointer[UInt]' to 'UnsafePointer[Int]' ``` * Tooling now prints the origins of `ref` arguments and results correctly, and prints `self` instead of `self: Self` in methods. * The Mojo Language Server and generated documentation now print parametric result types correctly, e.g. showing `SIMD[type, simd_width]` instead of `SIMD[$0, $1]`. * Generated API documentation now shows the signatures for structs, and identifies `@register_passable` and `@register_passable("trivial")` types. * The VS Code extension now allows cancelling the installation of its private MAX SDK. * The VS Code extension now opens the Run and Debug tab automatically whenever a debug session starts. * The `mojo debug --vscode` command now support the `--init-command` and `--stop-on-entry` flags. Execute `mojo debug --help` for more information. * The Mojo LLDB debugger on VS Code now supports inspecting the raw attributes of variables that are handled as synthetic types, e.g. `List` from Mojo or `std::vector` from C++. * The VS Code extension now allows selecting a default SDK when multiple are available. ### ❌ Removed * The `UnsafePointer.bitcast()` overload for `DType` has been removed. Wrap your `DType` in a `Scalar[my_dtype]` to call the only overload of `bitcast()` now. ### 🛠️ Fixed * Lifetime tracking is now fully field sensitive, which makes the uninitialized variable checker more precise. * [Issue #1310](https://github.com/modular/modular/issues/1310) - Mojo permits the use of any constructor for implicit conversions * [Issue #1632](https://github.com/modular/modular/issues/1632) - Mojo produces weird error when inout function is used in non mutating function * [Issue #3444](https://github.com/modular/modular/issues/3444) - Raising init causing use of uninitialized variable * [Issue #3544](https://github.com/modular/modular/issues/3544) - Known mutable `ref` argument are not optimized as `noalias` by LLVM. * [Issue #3559](https://github.com/modular/modular/issues/3559) - VariadicPack doesn't extend the lifetimes of the values it references. * [Issue #3627](https://github.com/modular/modular/issues/3627) - Compiler overlooked exclusivity violation caused by `ref [MutableAnyOrigin] T` * [Issue #3710](https://github.com/modular/modular/issues/3710) - Mojo frees memory while reference to it is still in use. * [Issue #3805](https://github.com/modular/modular/issues/3805) - Crash When Initializing !llvm.ptr. * [Issue #3816](https://github.com/modular/modular/issues/3816) - Ternary if-operator doesn't propagate origin information. * [Issue #3815](https://github.com/modular/modular/issues/3815) - \[BUG] Mutability not preserved when taking the union of two origins. * [Issue #3829](https://github.com/modular/modular/issues/3829) - Poor error message when invoking a function pointer upon an argument of the wrong origin * [Issue #3830](https://github.com/modular/modular/issues/3830) - Failures emitting register RValues to ref arguments. * The VS Code extension now auto-updates its private copy of the MAX SDK. * The variadic initializer for `SIMD` now works in parameter expressions. * The VS Code extension now downloads its private copy of the MAX SDK in a way that prevents `ETXTBSY` errors on Linux. * The VS Code extension now allows invoking a mojo formatter from SDK installations that contain white spaces in their path. ### Special thanks Special thanks to our community contributors: [@soraos](https://github.com/soraros), [@jjvraw](https://github.com/jjvraw), [@bgreni](https://github.com/bgreni), [@thatstoasty](https://github.com/thatstoasty), [@szbergeron](https://github.com/szbergeron), [@rd4com](https://github.com/rd4com), [@fknfilewalker](https://github.com/fknfilewalker), [@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse), [@avitkauskas](https://github.com/avitkauskas), and [@martinvuyk](https://github.com/martinvuyk). ## v24.5 (2024-09-13) ### ✨ Highlights Here's a brief summary of some of the major changes in this release, with more detailed information in the following sections: * Mojo now supports Python 3.12 interoperability. * The set of automatically imported entities (types, aliases, functions) into users' Mojo programs has been dramatically reduced. This can break existing user code as users will need to explicitly import what they're using for cases previously automatically included before. * [`print()`](/mojo/stdlib/builtin/io/print) now requires that its arguments conform to the [`Formattable`](/mojo/stdlib/utils/write/Writable) trait. This enables efficient stream-based writing by default, avoiding unnecessary intermediate String heap allocations. * The new builtin [`input()`](/mojo/stdlib/builtin/io/input) function prints an optional prompt and reads a line from standard input, in the same way as Python. * Mojo now allows implicit definitions of variables within a `fn` in the same way that has been allowed in a `def`. The `var` keyword is still allowed, but is now optional. * Mojo now diagnoses "argument exclusivity" violations due to aliasing references. Mojo requires references (including implicit references due to `borrowed`/`inout` arguments) to be uniquely referenced (non-aliased) if mutable. This is a warning in the 24.5 release, but will be upgraded to an error in subsequent releases. * Mojo now supports "conditional conformances" where some methods on a struct have additional trait requirements that the struct itself doesn't. * `DTypePointer`, `LegacyPointer`, and `Pointer` have been removed. Use [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) instead. Functions that previously took a `DTypePointer` now take an equivalent `UnsafePointer`. For more information on using pointers, see [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual. * There are many new standard library APIs, with new features for strings, collections, and interacting with the filesystem and environment. Changes are listed in the standard library section. * The VS Code extension now supports a vendored MAX SDK for VS Code, which is automatically downloaded by the extension and it's used for all Mojo features, including the Mojo Language Server, the Mojo debugger, the Mojo formatter, and more. * [`mojo test`](/mojo/cli/test) now uses the Mojo compiler for running unit tests. This will resolve compilation issues that sometimes appeared, and will also improve overall test execution times. ### Language changes * Mojo now allows implicit definitions of variables within a `fn` in the same way that has been allowed in a `def`. The `var` keyword is still allowed and still denotes the declaration of a new variable with a scope (in both `def` and `fn`). Relaxing this makes `fn` and `def` more similar, but they still differ in other important ways. * Mojo now diagnoses "argument exclusivity" violations due to aliasing references. Mojo requires references (including implicit references due to `borrowed`/`inout` arguments) to be uniquely referenced (non-aliased) if mutable. This is important for code safety, because it allows the compiler (and readers of code) to understand where and when a value is mutated. It is also useful for performance optimization because it allows the compiler to know that accesses through immutable references cannot change behind the scenes. Here is an invalid example: ```mojo fn take_two_strings(a: String, inout b: String): # Mojo knows 'a' and 'b' cannot be the same string. b += a fn invalid_access(): var my_string = String() # warning: passing `my_string` inout is invalid since it is also passed # borrowed. take_two_strings(my_string, my_string) ``` This is similar to [Swift exclusivity checking](https://swift.org/blog/swift-5-exclusivity/) and the [Rust language](https://doc.rust-lang.org/beta/book/ch04-02-references-and-borrowing.html) sometimes known as "aliasing xor mutability". That said, the Mojo implementation details are somewhat different because lifetimes are embedded in types. This is a warning in the 24.5 release, but will be upgraded to an error in subsequent releases. :::note Argument exclusivity is not enforced for register-passable types. They are passed by copy, so they don't form aliases. ::: * Mojo now supports "conditional conformances" where some methods on a struct have additional trait requirements that the struct itself doesn't. This is expressed through an explicitly declared `self` type: ```mojo struct GenericThing[Type: AnyType]: # Works with anything # Sugar for 'fn normal_method[Type: AnyType](self: GenericThing[Type]):' fn normal_method(self): ... # Just redeclare the requirements with more specific types: fn needs_move[Type: Movable](self: GenericThing[Type], owned val: Type): var tmp = val^ # Ok to move 'val' since it is Movable ... fn usage_example(): var a = GenericThing[Int]() a.normal_method() # Ok, Int conforms to AnyType a.needs_move(42) # Ok, Int is movable var b = GenericThing[NonMovable]() b.normal_method() # Ok, NonMovable conforms to AnyType # error: argument type 'NonMovable' does not conform to trait 'Movable' b.needs_move(NonMovable()) ``` Conditional conformance works with dunder methods and other things as well. * As a specific form of "conditional conformances", initializers in a struct may indicate specific parameter bindings to use in the type of their `self` argument. For example: ```mojo @value struct MyStruct[size: Int]: fn __init__(inout self: MyStruct[0]): pass fn __init__(inout self: MyStruct[1], a: Int): pass fn __init__(inout self: MyStruct[2], a: Int, b: Int): pass def test(x: Int): a = MyStruct() # Infers size=0 from 'self' type. b = MyStruct(x) # Infers size=1 from 'self' type. c = MyStruct(x, x) # Infers size=2 from 'self' type. ``` * Mojo now supports named result bindings. Named result bindings are useful for directly emplacing function results into the output slot of a function. This feature provides more flexibility and guarantees around emplacing the result of a function compared to "guaranteed" named return value optimization (NRVO). If a `@register_passable` result is bound to a name, the result value is made accessible as a mutable reference. ```mojo fn efficiently_return_string(b: Bool) -> String as output: if b: output = "emplaced!" mutate(output) return return "regular return" ``` If we used a temporary for `output` instead, we would need to move into the result slot, which wouldn't work if the result type was non-movable. In a function with a named result, `return` may be used with no operand to signal an exit from the function, or it can be used normally to specify the return value of the function. The compiler will error if the result is not initialized on all normal exit paths from the function. * `__setitem__()` now works with variadic argument lists such as: ```mojo struct YourType: fn __setitem__(inout self, *indices: Int, val: Int): ... ``` The Mojo compiler now always passes the "new value" being set using the last keyword argument of the `__setitem__()`, e.g. turning `yourType[1, 2] = 3` into `yourType.__setitem__(1, 2, val=3)`. This fixes [Issue \#248](https://github.com/modular/modular/issues/248). * Mojo context managers used in regions of code that may raise no longer need to define a "conditional" exit function in the form of `fn __exit__(self, e: Error) -> Bool`. This function allows the context manager to conditionally intercept and handle the error and allow the function to continue executing. This is useful for some applications, but in many cases the conditional exit would delegate to the unconditional exit function `fn __exit__(self)`. Concretely, this enables defining `with` regions that unconditionally propagate inner errors, allowing code like: ```mojo def might_raise() -> Int: ... def foo() -> Int: with ContextMgr(): return might_raise() # no longer complains about missing return def bar(): var x: Int with ContextMgr(): x = might_raise() print(x) # no longer complains about 'x' being uninitialized ``` * `async` functions now support memory-only results (like `String`, `List`, etc.) and `raises`. Accordingly, both [`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine) and [`RaisingCoroutine`](/mojo/stdlib/builtin/coroutine/RaisingCoroutine) have been changed to accept `AnyType` instead of `AnyTrivialRegType`. This means the result types of `async` functions do not need to be `Movable`. ```mojo async fn raise_or_string(c: Bool) raises -> String: if c: raise "whoops!" return "hello world!" ``` Note that `async` functions do not yet support indirect calls, `ref` results, and constructors. * The [`Reference`](/mojo/stdlib/memory/pointer/Pointer) type (and many iterators) now use [infer-only parameters](/mojo/manual/parameters/#infer-only-parameters) to represent the mutability of their lifetime, simplifying the interface. * The environment variable `MOJO_PYTHON` can be pointed to an executable to pin Mojo to a specific version: ```sh export MOJO_PYTHON="/usr/bin/python3.11" ``` Or a virtual environment to always have access to those Python modules: ```sh export MOJO_PYTHON="~/venv/bin/python" ``` `MOJO_PYTHON_LIBRARY` still exists for environments with a dynamic `libpython` but no Python executable. * The pointer aliasing semantics of Mojo have changed. Initially, Mojo adopted a C-like set of semantics around pointer aliasing and derivation. However, the C semantics bring a lot of history and baggage that are not needed in Mojo and which complicate compiler optimizations. The language overall provides a stronger set of invariants around pointer aliasing with lifetimes and exclusive mutable references to values, etc. It is now forbidden to convert a non-pointer-typed value derived from a Mojo-allocated pointer, such as an integer address, to a pointer-typed value. "Derived" means there is overlap in the bits of the non-pointer-typed value with the original pointer value. Accordingly, the [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) constructor that took an `address` keyword argument has been removed. It is still possible to make this conversion in certain cases where it is absolutely necessary, such as interoperating with other languages like Python. In this case, the compiler makes two assumptions: any pointer derived from a non-pointer-typed value does not alias any Mojo-derived pointer and that any external function calls have arbitrary memory effects. * `await` on a coroutine now consumes it. This strengthens the invariant that coroutines can be awaited only once. ### Standard library changes * [`builtin`](/mojo/stdlib/builtin/) package: * The set of automatically imported entities (types, aliases, functions) into users' Mojo programs has been dramatically reduced. Before, with the way the `builtin` module was handled, all of the entities in the following modules would be automatically included: `memory`, `sys`, `os`, `utils`, `python`, `bit`, `random`, `math`, `builtin`, `collections` Now, only the explicitly enumerated entities in `prelude/__init__.mojo` are the ones automatically imported into users' Mojo programs. This will break a lot of user code as users will need to explicitly import what they're using for cases previously commonly included before (such as [`Optional`](/mojo/stdlib/collections/optional/Optional), [`Variant`](/mojo/stdlib/utils/variant/Variant), and functions such as [`abort()`](/mojo/stdlib/os/os/abort), [`alignof()`](/mojo/stdlib/sys/info/alignof), [`bitcast()`](/mojo/stdlib/memory/unsafe/bitcast), [`bitwidthof()`](/mojo/stdlib/sys/info/bitwidthof), [`external_call()`](/mojo/stdlib/sys/ffi/external_call), [`simdwidthof()`](/mojo/stdlib/sys/info/simdwidthof), and [`sizeof()`](/mojo/stdlib/sys/info/sizeof)). * Some types from the `builtin` module have been moved to different modules for clarity which is made possible now that we have a `prelude` module that can re-export symbols from modules other than `builtin`. In particular, the `builtin.string` module has been moved to [`collections.string`](/mojo/stdlib/collections/string/). * Input and output: * Added the builtin [`input()`](/mojo/stdlib/builtin/io/input) function, which behaves the same as Python. ([PR #3392](https://github.com/modular/modular/pull/3392)) ```mojo name = input("Enter your name: ") print("Hello, " + name + "!") ``` If the user enters "Mojo" it returns "Hello, Mojo!" There is a known issue when running the `input()` function with JIT compilation (see issue [#3479](https://github.com/modular/modular/issues/3479)). * [`print()`](/mojo/stdlib/builtin/io/print) now requires that its arguments conform to the [`Formattable`](/mojo/stdlib/utils/write/Writable) trait. This enables efficient stream-based writing by default, avoiding unnecessary intermediate String heap allocations. Previously, `print()` required types conform to [`Stringable`](/mojo/stdlib/builtin/str/Stringable). This meant that to execute a call like `print(a, b, c)`, at least three separate String heap allocations were down, to hold the formatted values of `a`, `b`, and `c` respectively. The total number of allocations could be much higher if, for example, `a.__str__()` was implemented to concatenate together the fields of `a`, like in the following example: ```mojo struct Point(Stringable): var x: Float64 var y: Float64 fn __str__(self) -> String: # Performs 3 allocations: 1 each for str(..) of each of the fields, # and then the final returned `String` allocation. return "(" + str(self.x) + ", " + str(self.y) + ")" ``` A type like the one above can transition to additionally implementing `Formattable` with the following changes: ```mojo struct Point(Stringable, Formattable): var x: Float64 var y: Float64 fn __str__(self) -> String: return String.format_sequence(self) fn format_to(self, inout writer: Formatter): writer.write("(", self.x, ", ", self.y, ")") ``` In the example above, [`String.format_sequence()`](/mojo/stdlib/collections/string/string/String#format_sequence) is used to construct a `String` from a type that implements `Formattable`. This pattern of implementing a type's `Stringable` implementation in terms of its `Formattable` implementation minimizes boilerplate and duplicated code, while retaining backwards compatibility with the requirements of the commonly used `str()` function. :::note The error shown when passing a type that does not implement `Formattable` to `print()` is currently not entirely descriptive of the underlying cause: ```shell error: invalid call to 'print': callee with non-empty variadic pack argument expects 0 positional operands, but 1 was specified print(point) ~~~~~^~~~~~~ ``` If you see the above error, ensure that all argument types implement `Formattable`. ::: * [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert) now also requires that its `message` argument conform to `Formattable`. * Added [`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory) in module `tempfile`. ([PR 2743](https://github.com/modular/modular/pull/2743)) * Added [`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile) in module `tempfile`. ([PR 2762](https://github.com/modular/modular/pull/2762)) * [`String`](/mojo/stdlib/collections/string/string) and friends: * The `builtin.string` module has been moved to [`collections.string`](/mojo/stdlib/collections/string/). * Added the [`String.format()`](/mojo/stdlib/collections/string/string/String#format) method. ([PR #2771](https://github.com/modular/modular/pull/2771)) Supports automatic and manual indexing of `*args`. Examples: ```mojo print( String("{1} Welcome to {0} {1}").format("mojo", "🔥") ) # 🔥 Wecome to mojo 🔥 ``` ```mojo print(String("{} {} {}").format(True, 1.125, 2)) #True 1.125 2 ``` * [`String.format()`](/mojo/stdlib/collections/string/string/String#format) now supports conversion flags `!s` and `!r`, allowing for `str()` and `repr()` conversions within format strings. ([PR \#3279](https://github.com/modular/modular/pull/3279)) Example: ```mojo String("{} {!r}").format("Mojo", "Mojo") # "Mojo 'Mojo'" String("{0!s} {0!r}").format("Mojo") # "Mojo 'Mojo'" ``` * The `String` class now has [`rjust()`](/mojo/stdlib/collections/string/string/String#rjust), [`ljust()`](/mojo/stdlib/collections/string/string/String#ljust), and [`center()`](/mojo/stdlib/collections/string/string/String#center) methods to return a justified string based on width and fillchar. ([PR \#3278](https://github.com/modular/modular/pull/3278)) * The [`atol()`](/mojo/stdlib/collections/string/string/atol) function now correctly supports leading underscores, (e.g.`atol("0x_ff", 0)`), when the appropriate base is specified or inferred (base 0). non-base-10 integer literals as per Python's [Integer Literals](https://docs.python.org/3/reference/lexical_analysis.html#integers). ([PR #3180](https://github.com/modular/modular/pull/3180)) * Added the [`unsafe_cstr_ptr()`](/mojo/stdlib/collections/string/string/String#unsafe_cstr_ptr) method to `String` and `StringLiteral`, which returns an `UnsafePointer[c_char]` for convenient interoperability with C APIs. * Added the `byte_length()` method to [`String`](/mojo/stdlib/collections/string/string/String#byte_length), [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#byte_length), and [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral#byte_length) and deprecated their private `_byte_length()` methods. Added a warning to the [`String.__len__()`](/mojo/stdlib/collections/string/string/String#__len__) method that it will return the length in Unicode codepoints in the future and [`StringSlice.__len__()`](/mojo/stdlib/collections/string/string_slice/StringSlice#__len__) now does return the Unicode codepoints length. ([PR \#2960](https://github.com/modular/modular/pull/2960)) * Added a new [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases) type alias. This can be used in place of [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral) for runtime string arguments. * Added a [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#__init__) initializer that accepts a `StringLiteral`. * The `StringRef` constructors from `DTypePointer.int8` have been changed to take a `UnsafePointer[c_char]`, reflecting their use for compatibility with C APIs. * Continued the transition to `UnsafePointer` and unsigned byte type for strings: * [`String.unsafe_ptr()`](/mojo/stdlib/collections/string/string/String#unsafe_ptr) now returns an `UnsafePointer[UInt8]` (was `UnsafePointer[Int8]`) * [`StringLiteral.unsafe_ptr()`](/mojo/stdlib/builtin/string_literal/StringLiteral#unsafe_ptr) now returns an `UnsafePointer[UInt8]` (was `UnsafePointer[Int8]`) * [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and other reference type changes: * `DTypePointer`, `LegacyPointer`, and `Pointer` have been removed. Use [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) instead. For more information on using pointers, see [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual. Functions that previously took a `DTypePointer` now take an equivalent `UnsafePointer`. A quick rule for conversion from `DTypePointer` to `UnsafePointer` is: ```mojo DTypePointer[type] -> UnsafePointer[Scalar[type]] ``` There could be places that you have code of the form: ```mojo fn f(ptr: DTypePointer): ``` which is equivalent to `DTypePointer[*_]`. In this case you would have to add an infer-only `type` parameter to the function: ```mojo fn f[type: DType, //](ptr: UnsafePointer[Scalar[type]]): ``` because we can’t have an unbound parameter inside the struct. There could also be places where you use `DTypePointer[Scalar[DType.invalid/index]]`, and it would be natural to change these to `UnsafePointer[NoneType/Int]`. But since these are not an `UnsafePointer` that stores a `Scalar`, you might have to `rebind/bitcast` to appropriate types. * The `DTypePointer` [`load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#load) and [`store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#store) methods have been moved to `UnsafePointer`. * `UnsafePointer` now supports [`strided_load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_load), [`strided_store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_store), [`gather()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#gather), and [`scatter()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#scatter) when the underlying type is `Scalar[DType]`. * The global functions for working with `UnsafePointer` have transitioned to being methods through the use of conditional conformances: * `destroy_pointee(p)` => [`p.destroy_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#destroy_pointee) * `move_from_pointee(p)` => [`p.take_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#take_pointee) * `initialize_pointee_move(p, value)` => [`p.init_pointee_move(value)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_move) * `initialize_pointee_copy(p, value)` => [`p.init_pointee_copy(value)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_copy) * `move_pointee(src=p1, dst=p2)` => [`p.move_pointee_into(p2)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#move_pointee_into) * The `UnsafePointer.offset()` method is deprecated and will be removed in a future release. Use [pointer arithmetic](/mojo/manual/pointers#storing-multiple-values) instead. ```mojo new_ptr = ptr.offset(1) ``` Becomes: ```mojo new_ptr = ptr + 1 ``` * `UnsafePointer` now has an [`alignment`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#parameters) parameter to specify the static alignment of the pointer. Consequently, [`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#alloc) no longer takes in an alignment parameter, and the alignment should be specified in the type. ```mojo UnsafePointer[type].alloc[alignment](x) # now becomes UnsafePointer[type, alignment].alloc(x) ``` * `UnsafePointer` has a new [`exclusive: Bool = False`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#parameters) parameter. Setting this parameter to true tells the compiler that the user knows this pointer and all those derived from it have exclusive access to the underlying memory allocation. The compiler is not guaranteed to do anything with this information. * It is no longer possible to cast (implicitly or explicitly) from `Reference` to `UnsafePointer`. Instead of `UnsafePointer(someRef)` please use the [`UnsafePointer.address_of(someRef[])`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#address_of) which makes the code explicit that the `UnsafePointer` gets the address of what the reference points to. * Python interoperability changes: * Mojo now supports Python 3.12 interoperability. * Creating a nested [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) from a list or tuple of Python objects is possible now: ```mojo var np = Python.import_module("numpy") var a = np.array([1, 2, 3]) var b = np.array([4, 5, 6]) var arrays = PythonObject([a, b]) assert_equal(len(arrays), 2) ``` Also allowing more convenient call syntax: ```mojo var stacked = np.hstack((a, b)) assert_equal(str(stacked), "[1 2 3 4 5 6]") ``` ([PR #3264](https://github.com/modular/modular/pull/3264)) * Accessing local Python modules with [`Python.add_to_path(".")`](/mojo/stdlib/python/python/Python#add_to_path) is no longer required. It now behaves the same as Python. You can access modules in the same folder as the target file: * `mojo run /tmp/main.mojo` can access `/tmp/mymodule.py` * `mojo build main.mojo -o ~/myexe && ~/myexe` can access `~/mymodule.py` * Collections: * [`List`](/mojo/stdlib/collections/list/List) values are now equality comparable with `==` and `!=` when their element type is equality comparable. ([PR #3195](https://github.com/modular/modular/pull/3195)) * [`Optional`](/mojo/stdlib/collections/optional/Optional) values are now equality comparable with `==` and `!=` when their element type is equality comparable. * Added a new [`Counter`](/mojo/stdlib/collections/counter/Counter) dictionary-like type, matching most of the features of the Python one. ([PR #2910](https://github.com/modular/modular/pull/2910)) * [`Dict`](/mojo/stdlib/collections/dict/Dict) now implements [`setdefault()`](/mojo/stdlib/collections/dict/Dict#setdefault), which gets a value from the dictionary by key, or sets it to a default if it doesn't exist. ([PR #2803](https://github.com/modular/modular/pull/2803)) * `Dict` now supports [`popitem()`](/mojo/stdlib/collections/dict/Dict#popitem), which removes and returns the last item in the `Dict`. ([PR #2701](https://github.com/modular/modular/pull/2701)) * Added a [`Dict.__init__()`](/mojo/stdlib/collections/dict/Dict#__init__) overload to specify initial capacity. ([PR #3171](https://github.com/modular/modular/pull/3171)) The capacity has to be a power of two and greater than or equal to 8. It allows for faster initialization by skipping incremental growth steps. Example: ```mojo var dictionary = Dict[Int,Int](power_of_two_initial_capacity = 1024) # Insert (2/3 of 1024) entries ``` * `ListLiteral` now supports [`__contains__()`](/mojo/stdlib/builtin/list_literal/ListLiteral#__contains__). ([PR #3251](https://github.com/modular/modular/pull/3251)) * Filesystem and environment utilities: * [`Path.home()`](/mojo/stdlib/pathlib/path/Path#home) has been added to return a path of the user's home directory. * [`os.path.expanduser()`](/mojo/stdlib/os/path/path/expanduser) and [`pathlib.Path.exapanduser()`](/mojo/stdlib/pathlib/path/Path#expanduser) have been added to allow expanding a prefixed `~` in a `String` or `Path` with the user's home path: ```mojo import os print(os.path.expanduser("~/.modular")) # /Users/username/.modular print(os.path.expanduser("~root/folder")) # /var/root/folder (on macos) # /root/folder (on linux) ``` * [`os.path.split()`](/mojo/stdlib/os/path/path/split) has been added for splitting a path into `head, tail`: ```mojo import os head, tail = os.path.split("/this/is/head/tail") print("head:", head) print("tail:", tail) # head: /this/is/head # tail: tail ``` * [`os.makedirs()`](/mojo/stdlib/os/os/makedirs) and [`os.removedirs()`](/mojo/stdlib/os/os/removedirs) have been added for creating and removing nested directories: ```mojo import os path = os.path.join("dir1", "dir2", "dir3") os.path.makedirs(path, exist_ok=True) os.path.removedirs(path) ``` * The [`pwd`](/mojo/stdlib/pwd/pwd/) module has been added for accessing user information in `/etc/passwd` on POSIX systems. This follows the same logic as Python: ```mojo import pwd import os current_user = pwd.getpwuid(os.getuid()) print(current_user) # pwd.struct_passwd(pw_name='jack', pw_passwd='********', pw_uid=501, # pw_gid=20, pw_gecos='Jack Clayton', pw_dir='/Users/jack', # pw_shell='/bin/zsh') print(current_user.pw_uid) # 501 root = pwd.getpwnam("root") print(root) # pwd.struct_passwd(pw_name='root', pw_passwd='*', pw_uid=0, pw_gid=0, # pw_gecos='System Administrator', pw_dir='/var/root', pw_shell='/bin/zsh') ``` * Other new traits and related features: * Added the [`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable) trait to mark types that can be copied explicitly, but which might not be implicitly copyable. This supports work to transition the standard library collection types away from implicit copyability, which can lead to unintended expensive copies. * Added the [`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable) trait, used to describe types that implement the `__is__()` and `__isnot__()` trait methods. ([PR #2807](https://github.com/modular/modular/pull/2807)) * Types conforming to [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) (that is, those implementing `__bool__()`) no longer implicitly convert to `Bool`. A new [`ImplicitlyBoolable`](/mojo/stdlib/builtin/bool/ImplicitlyBoolable) trait is introduced for types where this behavior is desired. * Miscellaneous: * [`NoneType`](/mojo/stdlib/builtin/none/NoneType) is now a normal standard library type, and not an alias for a raw MLIR type. Function signatures written as `fn() -> NoneType` should transition to being written as `fn() -> None`. * Mojo now has a [`UInt`](/mojo/stdlib/builtin/uint/UInt) type for modeling unsigned (scalar) integers with a platform-dependent width. `UInt` implements most arithmetic operations that make sense for integers, with the notable exception of `__neg__()`. Builtin functions such as `min()`/`max()`, as well as `math` functions like `ceildiv()`, `align_down()`, and `align_up()` are also implemented for `UInt`. * Now that we have a `UInt` type, use this to represent the return type of a hash. In general, hashes should be an unsigned integer, and can also lead to improved performance in certain cases. * Added the [`c_char`](/mojo/stdlib/sys/ffi/#aliases) type alias in `sys.ffi`. * [`sort()`](/mojo/stdlib/builtin/sort/sort) now supports a `stable` parameter. It can be called by ```mojo sort[cmp_fn, stable=True](list) ``` The algorithm requires $O(N)$ auxiliary memory. If extra memory allocation fails, the program crashs. * `sort()` no longer takes `LegacyPointer` since that type is now removed. * Added the [`oct()`](/mojo/stdlib/builtin/format_int/oct) builtin function for formatting an integer in octal. ([PR #2914](https://github.com/modular/modular/pull/2914)) * Added the [`assert_is()`](/mojo/stdlib/testing/testing/assert_is) and [`assert_is_not()`](/mojo/stdlib/testing/testing/assert_is_not) test functions to the `testing` module. * The [`math`](/mojo/stdlib/math/constants/) package now includes the `pi`, `e`, and `tau` constants (Closes Issue [#2135](https://github.com/modular/modular/issues/2135)). * The [`ulp`](/mojo/stdlib/math/math/ulp) function from `numerics` has been moved to the `math` module. * `bit` module now supports [`bit_reverse()`](/mojo/stdlib/bit/bit/bit_reverse), [`byte_swap()`](/mojo/stdlib/bit/bit/byte_swap), and [`pop_count()`](/mojo/stdlib/bit/bit/pop_count) for the `Int` type. ([PR #3150](https://github.com/modular/modular/pull/3150)) * A few `bit` functions have been renamed for clarity: * `countl_zero()` -> [`count_leading_zeros()`](/mojo/stdlib/bit/bit/count_leading_zeros) * `countr_zero()` -> [`count_trailing_zeros()`](/mojo/stdlib/bit/bit/count_trailing_zeros) * [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice) now uses `OptionalReg[Int]` for `start` and `end` and implements a constructor which accepts optional values. `Slice._has_end()` has also been removed since a Slice with no end is now represented by an empty `Slice.end` option. ([PR #2495](https://github.com/modular/modular/pull/2495)) ```mojo var s = Slice(1, None, 2) print(s.start.value()) # must retrieve the value from the optional ``` * The `rank` argument for [`algorithm.elementwise()`](/mojo/stdlib/algorithm/functional/elementwise) is no longer required and is only inferred. * The `time.now()` function has been deprecated. Please use [`time.perf_counter()`](/mojo/stdlib/time/time/perf_counter) or [`time.perf_counter_ns`](/mojo/stdlib/time/time/perf_counter_ns) instead. * [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) construction from `Bool` has been restricted to `DType.bool` data type. ### Tooling changes * [`mojo test`](/mojo/cli/test) new features and changes: * `mojo test` now uses the Mojo compiler for running unit tests. This will resolve compilation issues that sometimes appeared, and will also improve overall test times, since we will only compile unit tests once before executing all of them. These changes do not apply to doctests, due to their different semantics. * The `mojo test` command now accepts a `--filter` option that will narrow the set of tests collected and executed. The filter string is a POSIX extended regular expression. * The `mojo test` command now supports using the same compilation options as `mojo build`. * You can now debug unit tests using `mojo test` by passing the `--debug` flag. Most debug flags are supported; run `mojo test --help` for a full listing. Debugging doctests is not currently supported. * Mojo debugger new features and changes: * The `mojo debug --rpc` command has been renamed to [`mojo debug --vscode`](/mojo/cli/debug#debug-server-options), which is now able to manage multiple VS Code windows. * The Mojo debugger now supports a `break-on-raise` command that indicated the debugger to stop at any `raise` statements. A similar features has been added to the debugger on VS Code. * The Mojo debugger now hides the artificial function arguments `__result__` and `__error__` created by the compiler for Mojo code. * VS Code support changes: * The VS Code extension now supports a vendored MAX SDK for VS Code, which is automatically downloaded by the extension and it's used for all Mojo features, including the Mojo Language Server, the Mojo debugger, the Mojo formatter, and more. * A proxy has been added to the Mojo Language Server on VS Code that handles crashes more gracefully. * The Mojo Language Server no longer sets `.` as a commit character for auto-completion. ### ❌ Removed * Support for the legacy `fn __init__(...) -> Self:` form has been removed from the compiler, please switch to using `fn __init__(inout self, ...):` instead. * The builtin `tensor` module has been removed. Identical functionality is available in [`max.tensor`](/max/api/mojo/tensor/tensor), but it is generally recommended to use structs from the [`buffer`](/mojo/stdlib/buffer/buffer) module when possible instead. * Removed `String.unsafe_uint8_ptr()`. `String.unsafe_ptr()` now returns the same thing. * Removed `StringLiteral.unsafe_uint8_ptr()` and `StringLiteral.as_uint8_ptr()`. * Removed `SIMD.splat(value: Scalar[type])`. Use the constructor for `SIMD` instead. * Removed the `SIMD.{add,mul,sub}_with_overflow()` methods. * Removed the `SIMD.min()` and `SIMD.max()` methods. Identical functionality is available using the builtin [`min()`](/mojo/stdlib/builtin/math/min) and [`max()`](/mojo/stdlib/builtin/math/max) functions. * Removed the Mojo Language Server warnings for unused function arguments. * `Run Mojo File in Dedicated Terminal` action has been removed, and the action `Run Mojo File` will always open a dedicated terminal for each mojo file to guarantee a correct environment. ### 🛠️ Fixed * Fixed a crash in the Mojo Language Server when importing the current file. * Fixed crash when specifying variadic keyword arguments without a type expression in `def` functions, e.g.: ```mojo def foo(**kwargs): ... # now works ``` * Mojo now prints `ref` arguments and results in generated documentation correctly. * [#1734](https://github.com/modular/modular/issues/1734) - Calling `__copyinit__` on self causes crash. * [#3142](https://github.com/modular/modular/issues/3142) - \[QoI] Confusing `__setitem__` method is failing with a "must be mutable" error. * [#248](https://github.com/modular/modular/issues/248) - \[Feature] Enable `__setitem__` to take variadic arguments * [#3065](https://github.com/modular/modular/issues/3065) - Fix incorrect behavior of `SIMD.__int__` on unsigned types * [#3045](https://github.com/modular/modular/issues/3045) - Disable implicit SIMD conversion routes through `Bool` * [#3126](https://github.com/modular/modular/issues/3126) - \[BUG] List doesn't work at compile time. * [#3237](https://github.com/modular/modular/issues/3237) - \[BUG] Difference between `__getitem__` and `[.]` operator. * [#3336](https://github.com/modular/modular/issues/3336) - Fix outdated references to `let` in REPL documentation. * The VS Code extension no longer caches the information of the selected MAX SDK, which was causing issues upon changes in the SDK. * The Mojo debugger now stops showing spurious warnings when parsing closures. ### Special thanks Special thanks to our community contributors: [@jjvraw](https://github.com/jjvraw), [@artemiogr97](https://github.com/artemiogr97), [@martinvuyk](https://github.com/martinvuyk), [@jayzhan211](https://github.com/jayzhan211), [@bgreni](https://github.com/bgreni), [@mzaks](https://github.com/mzaks), [@msaelices](https://github.com/msaelices), [@rd4com](https://github.com/rd4com), [@jiex-liu](https://github.com/jiex-liu), [@kszucs](https://github.com/kszucs), [@thatstoasty](https://github.com/thatstoasty) ## v24.4 (2024-06-07) ### ✨ Highlights Big themes for this release: * Improvements to the performance and ease-of-use for `def` functions. * Continued unification of standard library APIs around the `UnsafePointer` type. * Many quality-of-life improvements for the standard library collection types. * Significant performance improvements when inserting into a `Dict`. Performance on this metric is still not where we'd like it to be, but it is much improved. * A new `@parameter for` mechanism for expressing compile-time loops, which replaces the earlier (and less reliable) `@unroll` decorator. * New Mojo Manual pages on [Control flow](/mojo/manual/control-flow), [Testing](/mojo/tools/testing) and using [unsafe pointers](/mojo/manual/pointers/unsafe-pointers). ### Language changes * Mojo has changed how `def` function arguments are processed. Previously, by default, arguments to a `def` were treated according to the `owned` convention, which makes a copy of the value, enabling that value to be mutable in the callee. This could lead to major performance issues because of the proliferation of unnecessary copies. It also required you to declare non-copyable types as `borrowed` explicitly. Now Mojo takes a different approach: `def` functions take arguments as `borrowed` by default (consistent with `fn` functions) but will make a local copy of the value **only if the argument is mutated** in the body of the function. This improves consistency, performance, and ease of use. * Implicit variable definitions in a `def` function are more flexible: you can now implicitly declare variables as the result of a tuple return, using `a,b,c = foo()`. For example: ```mojo def return_two(i: Int) -> (Int, Int): return i, i+1 a, b = return_two(5) ``` Implicit variable declarations can also now shadow global immutable symbols (such as module names and built-ins) without getting a compiler error. For example: ```mojo slice = foo() ``` * Mojo functions can return an auto-dereferenced reference to storage with a new `ref` keyword in the result type specifier. For example: ```mojo @value struct Pair: var first: Int var second: Int fn get_first_ref(inout self) -> ref [self] Int: return self.first fn show_mutation(): var somePair = Pair(5, 6) somePair.get_first_ref() = 1 ``` This approach provides a general way to return an "automatically dereferenced" reference of a given type. Notably, this eliminates the need for `__refitem__()` to exist. `__refitem__()` has thus been removed and replaced with `__getitem__()` that returns a reference. * Mojo added support for *infer-only parameters*. Infer-only parameters must appear at the beginning of the parameter list and cannot be explicitly specified by the user. They are declared to the left of a `//` marker, much like positional-only parameters. This allows programmers to define functions with dependent parameters to be called without the caller specifying all the necessary parameters. For example: ```mojo fn parameter_simd[dt: DType, //, value: Scalar[dt]](): print(value) fn call_it(): parameter_simd[Int32(42)]() ``` In the above example, `Int32(42)` is passed directly into `value`, the first parameter that isn't infer-only. `dt` is inferred from the parameter itself to be `DType.int32`. This also works with structs. For example: ```mojo struct ScalarContainer[dt: DType, //, value: Scalar[dt]]: pass fn foo(x: ScalarContainer[Int32(0)]): # 'dt' is inferred as `DType.int32` pass ``` This should make working with dependent parameters more ergonomic. See [Infer-only parameters](/mojo/manual/parameters/#infer-only-parameters) in the Mojo Manual. * Mojo now allows functions overloaded on parameters to be resolved when forming references to, but not calling, those functions. For example, the following now works: ```mojo fn overloaded_parameters[value: Int32](): pass fn overloaded_parameters[value: Float32](): pass fn form_reference(): alias ref = overloaded_parameters[Int32()] # works! ``` * Mojo now supports adding a `@deprecated` decorator on structs, functions, traits, aliases, and global variables. The decorator marks the attached declaration as deprecated and causes a warning to be emitted when the deprecated declaration is referenced in user code. The decorator requires a deprecation message, specified as a string literal. ```mojo @deprecated("Foo is deprecated, use Bar instead") struct Foo: pass fn outdated_api(x: Foo): # warning: Foo is deprecated, use Bar instead pass @deprecated("use another function!") fn bar(): pass fn techdebt(): bar() # warning: use another function! ``` * Mojo has introduced [`@parameter for`](/mojo/manual/decorators/parameter#parametric-for-statement), a new feature for compile-time programming. `@parameter for` defines a for loop where the sequence and the induction values in the sequence must be parameter values. For example: ```mojo fn parameter_for[max: Int](): @parameter for i in range(max) @parameter if i == 10: print("found 10!") ``` Currently, `@parameter for` requires the sequence's `__iter__()` method to return a `_StridedRangeIterator`, meaning the induction variables must be `Int`. The intention is to lift these restrictions in the future. * The `is_mutable` parameter of `Reference` and `AnyLifetime` is now a `Bool`, not a low-level `__mlir_type.i1` value. This improves the ergonomics of spelling out a `Reference` type explicitly. * Mojo will now link to a Python dynamic library based on the Python on top of your search path: `PATH`. This enables you to activate a virtual environment like `conda` and have access to Python modules installed in that environment without setting `MOJO_PYTHON_LIBRARY`. Previously Mojo would find a `libpython` dynamic library on installation and put the path in `.modular/modular.cfg`, which could result in version conflicts if you activated a virtual environment of a different Python version. * `AnyRegType` has been renamed to `AnyTrivialRegType` and Mojo now forbids binding non-trivial register-passable types to `AnyTrivialRegType`. This closes a major safety hole in the language. Please use `AnyType` for generic code going forward. * The `let` keyword has been completely removed from the language. We previously removed `let` declarations but still provided an error message to users. Now, it is completely gone from the grammar. ### Standard library changes * New traits and related features: * Added built-in [`repr()`](/mojo/stdlib/builtin/repr/repr) function and [`Representable`](/mojo/stdlib/builtin/repr/Representable) trait. ([PR #2361](https://github.com/modular/modular/pull/2361)) * Added the [`Indexer`](/mojo/stdlib/builtin/int/Indexer) trait to denote types that implement the `__index__()` method which allows these types to be accepted in common `__getitem__()` and `__setitem__()` implementations, as well as allow a new built-in [`index()`](/mojo/stdlib/builtin/int/index-function) function to be called on them. Most standard library containers can now be indexed by any type that implements `Indexer`. For example: ```mojo @value struct AlwaysZero(Indexer): fn __index__(self) -> Int: return 0 struct MyList: var data: List[Int] fn __init__(inout self): self.data = List[Int](1, 2, 3, 4) fn __getitem__[T: Indexer](self, idx: T) -> Int: return self.data[index(idx)] print(MyList()[AlwaysZero()]) # prints `1` ``` Types conforming to the `Indexer` trait are implicitly convertible to Int. This means you can write generic APIs that take `Int` instead of making them take a generic type that conforms to `Indexer`. For example: ```mojo @value struct AlwaysZero(Indexer): fn __index__(self) -> Int: return 0 @value struct Incrementer: fn __getitem__(self, idx: Int) -> Int: return idx + 1 var a = Incrementer() print(a[AlwaysZero()]) # works and prints 1 ``` ([PR #2685](https://github.com/modular/modular/pull/2685)) * Added traits allowing user-defined types to be supported by various built-in and math functions. | Function | Trait | Required method | | ------------------------------------------------ | -------------------------------------------------- | --------------- | | [`abs()`](/mojo/stdlib/builtin/math/abs) | [`Absable`](/mojo/stdlib/builtin/math/Absable) | `__abs__()` | | [`pow()`](/mojo/stdlib/builtin/math/pow) | [`Powable`](/mojo/stdlib/builtin/math/Powable) | `__pow__()` | | [`round()`](/mojo/stdlib/builtin/math/round) | [`Roundable`](/mojo/stdlib/builtin/math/Roundable) | `__round__()` | | [`math.ceil`](/mojo/stdlib/math/math/ceil) | `math.Ceilable` | `__ceil__()` | | [`math.ceildiv`](/mojo/stdlib/math/math/ceildiv) | `math.CeilDivable` `math.CeilDivableRaising` | `__ceildiv__()` | | [`math.floor`](/mojo/stdlib/math/math/floor) | `math.Floorable` | `__floor__()` | | [`math.trunc`](/mojo/stdlib/math/math/trunc) | `Truncable` | `__trunc__()` | Notes: * Conforming to the `Powable` trait also means that the type can be used with the power operator (`**`). * For `ceildiv()`, structs can conform to either the `CeilDivable` trait or `CeilDivableRaising` trait. * Due to ongoing refactoring, the traits `Ceilable`, `CeilDivable`, `Floorable`, and `Truncable` do not appear in the API reference. They should be imported from the `math` module, except for `Truncable` which is (temporarily) available as a built-in trait and does not need to be imported. Example: ```mojo from math import sqrt @value struct Complex2(Absable, Roundable): var re: Float64 var im: Float64 fn __abs__(self) -> Self: return Self(sqrt(self.re * self.re + self.im * self.im), 0.0) fn __round__(self) -> Self: return Self(round(self.re, 0), round(self.im, 0)) fn __round__(self, ndigits: Int) -> Self: return Self(round(self.re, ndigits), round(self.im, ndigits)) ``` * Benchmarking: * The [`bencher`](/mojo/stdlib/benchmark/bencher/) module as part of the `benchmark` package is now public and documented. This module provides types such as `Bencher` which provides the ability to execute a `Benchmark` and allows for benchmarking configuration via the `BenchmarkConfig` struct. * [`String`](/mojo/stdlib/collections/string/string) and friends: * **Breaking.** Implicit conversion to `String` is now removed for builtin classes/types. Use [`str()`](/mojo/stdlib/builtin/str/str) explicitly to convert to `String`. * Added [`String.isspace()`](/mojo/stdlib/collections/string/string/String#isspace) method conformant with Python's universal separators. This replaces the `isspace()` free function from the `string` module. (If you need the old function, it is temporarily available as `_isspace()`. It now takes a `UInt8` but is otherwise unchanged.) * [`String.split()`](/mojo/stdlib/collections/string/string/String#split) now defaults to whitespace and has Pythonic behavior in that it removes all adjacent whitespace by default. * [`String.strip()`](/mojo/stdlib/collections/string/string/String#strip), [`lstrip()`](/mojo/stdlib/collections/string/string/String#lstrip) and [`rstrip()`](/mojo/stdlib/collections/string/string/String#rstrip) can now remove custom characters other than whitespace. In addition, there are now several useful aliases for whitespace, ASCII lower/uppercase, and so on. ([PR #2555](https://github.com/modular/modular/pull/2555)) * `String` now has a [`splitlines()`](/mojo/stdlib/collections/string/string/String#splitlines) method, which allows splitting strings at line boundaries. This method supports [universal newlines](https://docs.python.org/3/glossary.html#term-universal-newlines) and provides an option to retain or remove the line break characters. ([PR \#2810](https://github.com/modular/modular/pull/2810)) * `InlinedString` has been renamed to [`InlineString`](/mojo/stdlib/collections/string/inline_string/InlineString) to be consistent with other types. * [`StringRef`](/mojo/stdlib/collections/string/string_slice/StringSlice) now implements [`strip()`](/mojo/stdlib/collections/string/string_slice/StringSlice#strip), which can be used to remove leading and trailing whitespace. ([PR \#2683](https://github.com/modular/modular/pull/2683)) * `StringRef` now implements [`startswith()`](/mojo/stdlib/collections/string/string_slice/StringSlice#startswith) and [`endswith()`](/mojo/stdlib/collections/string/string_slice/StringSlice#endswith). ([PR #2710](https://github.com/modular/modular/pull/2710)) * Added a new [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) type, to replace uses of the unsafe `StringRef` type in standard library code. `StringSlice` is a non-owning reference to encoded string data. Unlike `StringRef`, a `StringSlice` is safely tied to the lifetime of the data it points to. * Added new [`as_string_slice()`](/mojo/stdlib/collections/string/string/String#as_string_slice) methods to `String` and `StringLiteral`. * Added `StringSlice` initializer from an `UnsafePointer` and a length in bytes. * Added a new [`as_bytes_slice()`](/mojo/stdlib/collections/string/string/String#as_bytes_slice) method to `String` and `StringLiteral`, which returns a `Span` of the bytes owned by the string. * Continued transition to [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and unsigned byte type for strings: * Renamed `String._as_ptr()` to [`String.unsafe_ptr()`](/mojo/stdlib/collections/string/string/String#unsafe_ptr), and changed return type to `UnsafePointer` (was `DTypePointer`). * Renamed `StringLiteral.data()` to [`StringLiteral.unsafe_ptr()`](/mojo/stdlib/builtin/string_literal/StringLiteral#unsafe_ptr), and changed return type to `UnsafePointer` (was `DTypePointer`). * `InlineString.as_ptr()` has been renamed to [`unsafe_ptr()`](/mojo/stdlib/collections/string/inline_string/InlineString#unsafe_ptr) and now returns an `UnsafePointer[UInt8]` (was `DTypePointer[DType.int8]`). * `StringRef.data` is now an `UnsafePointer` (was `DTypePointer`) and [`StringRef.unsafe_ptr()`](/mojo/stdlib/collections/string/string_slice/StringSlice#unsafe_ptr) now returns an `UnsafePointer[UInt8]` (was `DTypePointer[DType.int8]`). * Other built-ins: * The `Slice.__len__()` function has been removed and [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice) no longer conforms to the `Sized` trait. This clarifies the ambiguity of the semantics: the length of a slice always depends on the length of the object being sliced. Users that need the existing functionality can use the [`Slice.unsafe_indices()`](/mojo/stdlib/builtin/builtin_slice/Slice#indices) method. This makes it explicit that this implementation does not check if the slice bounds are concrete or within any given object's length. * Added a built-in [`sort()`](/mojo/stdlib/builtin/sort/sort) function for lists of elements that conform to the [`ComparableCollectionElement`](/mojo/stdlib/builtin/value/ComparableCollectionElement) trait.([PR #2609](https://github.com/modular/modular/pull/2609)) * [`int()`](/mojo/stdlib/builtin/int/int-function) can now take a string and a specified base to parse an integer from a string: `int("ff", 16)` returns `255`. Additionally, if a base of zero is specified, the string will be parsed as if it was an integer literal, with the base determined by whether the string contains the prefix `"0x"`, `"0o"`, or `"0b"`. ([PR #2273](https://github.com/modular/modular/pull/2273), fixes [#2274](https://github.com/modular/modular/issues/2274)) * Added the [`bin()`](/mojo/stdlib/builtin/format_int/bin) built-in function to convert integral types into their binary string representation. ([PR #2603](https://github.com/modular/modular/pull/2603)) * Added the [`atof()`](/mojo/stdlib/collections/string/string/atof) built-in function, which can convert a `String` to a `float64`. ([PR \#2649](https://github.com/modular/modular/pull/2649)) * You can now use the built-in [`any()`](/mojo/stdlib/builtin/bool/any) and [`all()`](/mojo/stdlib/builtin/bool/all) functions to check for truthy elements in a collection. Because `SIMD.__bool__()` is now constrained to `size=1`, You must explicitly use these to get the truthy value of a SIMD vector with more than one element. This avoids common bugs around implicit conversion of `SIMD` to `Bool`. ([PR #2600](https://github.com/modular/modular/pull/2600)) For example: ```mojo fn truthy_simd(): var vec = SIMD[DType.int32, 4](0, 1, 2, 3) if any(vec): print("any elements are truthy") if all(vec): print("all elements are truthy") ``` * `object` now implements all the bitwise operators. ([PR #2324](https://github.com/modular/modular/pull/2324)) * [`Tuple`](/mojo/stdlib/builtin/tuple/Tuple) now supports `__contains__()`. ([PR #2709](https://github.com/modular/modular/pull/2709)) For example: ```mojo var x = Tuple(1, 2, True) if 1 in x: print("x contains 1") ``` * [`ListLiteral`](/mojo/stdlib/builtin/list_literal/ListLiteral) and `Tuple` now only require that element types be `Movable`. Consequently, `ListLiteral` and `Tuple` are themselves no longer `Copyable`. * Added new `ImmutableStaticLifetime` and `MutableStaticLifetime` helpers. * [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and others: * Added new [`memcpy()`](/mojo/stdlib/memory/memory/memcpy) overload for `UnsafePointer[Scalar[_]]` pointers. * Removed the `get_null()` method from `UnsafePointer` and other pointer types. Please use the default constructor instead: `UnsafePointer[T]()`. * Many functions returning a pointer type have been unified to have a public API function of `unsafe_ptr()`. * The `Tensor.data()` method has been renamed to `unsafe_ptr()`. The return type is still a `DTypePointer[T]`. * Collections: * [`List`](/mojo/stdlib/collections/list/List) now has an [`index()`](/mojo/stdlib/collections/list/List#index) method that allows you to find the (first) location of an element in a `List` of `EqualityComparable` types. For example: ```mojo var my_list = List[Int](2, 3, 5, 7, 3) print(my_list.index(3)) # prints 1 ``` * `List` can now be converted to a `String` with a simplified syntax: ```mojo var my_list = List[Int](2, 3) print(my_list.__str__()) # prints [2, 3] ``` Note that `List` doesn't conform to the `Stringable` trait yet so you cannot use `str(my_list)` yet. ([PR #2673](https://github.com/modular/modular/pull/2673)) * `List` has a simplified syntax to call the [`count()`](/mojo/stdlib/collections/list/List#count) method: `my_list.count(x)`. ([PR #2675](https://github.com/modular/modular/pull/2675)) * `List()` now supports `__contains__()`, so you can now use lists with the `in` operator: ```mojo if x in my_list: ``` ([PR #2667](https://github.com/modular/modular/pull/2667)) * `List` now has an [`unsafe_get()`](/mojo/stdlib/collections/list/List#unsafe_get) to get the reference to an element without bounds check or wraparound for negative indices. Note that this method is unsafe. Use with caution. [PR #2800](https://github.com/modular/modular/pull/2800) * Added a [`fromkeys()`](/mojo/stdlib/collections/dict/Dict#fromkeys) method to `Dict` to return a `Dict` with the specified keys and values. ([PR 2622](https://github.com/modular/modular/pull/2622)) * Added a [`clear()`](/mojo/stdlib/collections/dict/Dict#clear) method to `Dict`. ([PR 2627](https://github.com/modular/modular/pull/2627)) * `Dict` now supports [`reversed()`](/mojo/stdlib/builtin/reversed/reversed) for its `items()` and `values()` iterators. ([PR #2340](https://github.com/modular/modular/pull/2340)) * `Dict` now has a simplified conversion to `String` with `my_dict.__str__()`. Note that `Dict` does not conform to the `Stringable` trait so `str(my_dict)` is not possible yet. ([PR #2674](https://github.com/modular/modular/pull/2674)) * `Dict` now implements [`get(key)`](/mojo/stdlib/collections/dict/Dict#get) and `get(key, default)` functions. ([PR #2519](https://github.com/modular/modular/pull/2519)) * Added a temporary `__get_ref(key)` method to `Dict`, allowing you to get a `Reference` to a dictionary value. * Added a new [`InlineList`](/mojo/stdlib/collections/inline_array/InlineArray) type, a stack-allocated list with a static maximum size. ([PR 2587#](https://github.com/modular/modular/pull/2587)) ([PR #2703](https://github.com/modular/modular/pull/2703)) * Added a new [`Span`](/mojo/stdlib/memory/span/Span) type for taking slices of contiguous collections. ([PR \#2595](https://github.com/modular/modular/pull/2595)) * [`os`](/mojo/stdlib/os/os/) module: * The `os` module now provides functionality for adding and removing directories using [`mkdir()`](/mojo/stdlib/os/os/mkdir) and [`rmdir()`](/mojo/stdlib/os/os/rmdir). ([PR #2430](https://github.com/modular/modular/pull/2430)) * Added the [`os.path.getsize()`](/mojo/stdlib/os/path/path/getsize) function, which gives the size in bytes of the file identified by the path. ([PR 2626](https://github.com/modular/modular/pull/2626)) * Added [`os.path.join()`](/mojo/stdlib/os/path/path/join) function. ([PR 2792](https://github.com/modular/modular/pull/2792)) * Added a new [`tempfile`](/mojo/stdlib/tempfile/tempfile/) module, with `gettempdir()` and `mkdtemp()` functions. ([PR 2742](https://github.com/modular/modular/pull/2742)) * [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type: * Added [`SIMD.shuffle()`](/mojo/stdlib/builtin/simd/SIMD#shuffle) with `IndexList` mask. ([PR #2315](https://github.com/modular/modular/pull/2315)) * [`SIMD.__bool__()`](/mojo/stdlib/builtin/simd/SIMD#__bool__) is constrained such that it only works when `size` is `1`. For SIMD vectors with more than one element, use [`any()`](/mojo/stdlib/builtin/bool/any) or [`all()`](/mojo/stdlib/builtin/bool/all). ([PR #2502](https://github.com/modular/modular/pull/2502)) * The [`SIMD.reduce_or()`](/mojo/stdlib/builtin/simd/SIMD#reduce_or) and [`SIMD.reduce_and()`](/mojo/stdlib/builtin/simd/SIMD#reduce_and) methods are now bitwise operations, and support integer types. ([PR #2671](https://github.com/modular/modular/pull/2671)) * Added [`SIMD.__repr__()`](/mojo/stdlib/builtin/simd/SIMD#__repr__) to get the verbose string representation of `SIMD` types. ([PR #2728](https://github.com/modular/modular/pull/2728)) * [`math`](/mojo/stdlib/math/math/) package: * The `math.bit` module has been moved to a new top-level [`bit`](/mojo/stdlib/bit/bit/) module. The following functions in this module have been renamed: * `ctlz` -> `countl_zero` * `cttz` -> `countr_zero` * `bit_length` -> `bit_width` * `ctpop` -> `pop_count` * `bswap` -> `byte_swap` * `bitreverse` -> `bit_reverse` * The `math.rotate_bits_left()` and `math.rotate_bits_right()` functions have been moved to the `bit` module. * The `is_power_of_2()` function in the `math` module is now called `is_power_of_two()` and located in the `bit` module. * The `abs()`, `round()`, `min()`, `max()`, `pow()`, and `divmod()` functions have moved from `math` to `builtin`, so you no longer need to import these functions. * The `math.tgamma()` function has been renamed to [`math.gamma()`](/mojo/stdlib/math/math/gamma) to conform with Python's naming. * The implementation of the following functions have been moved from the `math` module to the new [`utils.numerics`](/mojo/stdlib/utils/numerics/) module: `isfinite()`, `isinf()`, `isnan()`, `nan()`, `nextafter()`, and `ulp()`. The functions continue to be exposed in the `math` module. * [`math.gcd()`](/mojo/stdlib/math/math/gcd) now works on negative inputs, and like Python's implementation, accepts a variadic list of integers. New overloads for a `List` or `Span`of integers are also added. ([PR #2777](https://github.com/modular/modular/pull/2777)) * Async and coroutines: * [`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine) now requires a lifetime parameter. This parameter is set automatically by the parser when calling an async function. It contains the lifetimes of all the arguments and any lifetime accesses by the arguments. This ensures that argument captures by async functions keep the arguments alive as long as the coroutine is alive. * Async function calls are no longer allowed to borrow non-trivial register-passable types. Because async functions capture their arguments but register-passable types don't have lifetimes (yet), Mojo is not able to correctly track the reference, making this unsafe. To cover this safety gap, Mojo has temporarily disallowed binding non-trivial register-passable types to borrowed arguments in async functions. * Miscellaneous: * Added an [`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray) type that works on memory-only types. Compare with the existing [`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple) type, which is conceptually an array type, but only works on `AnyTrivialRegType`. ([PR \#2294](https://github.com/modular/modular/pull/2294)) * The [`base64`](/mojo/stdlib/base64/) package now includes encoding and decoding support for both the Base64 and Base16 encoding schemes. ([PR #2364](https://github.com/modular/modular/pull/2364)) ([PR #2584](https://github.com/modular/modular/pull/2584)) * The `take()` function in [`Variant`](/mojo/stdlib/utils/variant/Variant) and [`Optional`](/mojo/stdlib/collections/optional/Optional) has been renamed to `unsafe_take()`. * The `get()` function in `Variant` has been replaced by `__getitem__()`. That is, `v.get[T]()` should be replaced with `v[T]`. * Various functions in the `algorithm` module are now built-in functions. This includes `sort()`, `swap()`, and `partition()`. `swap()` and `partition()` will likely shuffle around as we're reworking our built-in `sort()` function and optimizing it. * `infinity` and `NaN` are now correctly handled in [`testing.assert_almost_equal()`](/mojo/stdlib/testing/testing/assert_almost_equal) and an `inf` function has been added to `utils/numerics.mojo`. ([PR #2375](https://github.com/modular/modular/pull/2375)) ### Tooling changes * Invoking `mojo package my-package -o my-dir` on the command line, where `my-package` is a Mojo package source directory, and `my-dir` is an existing directory, now outputs a Mojo package to `my-dir/my-package.mojopkg`. Previously, this had to be spelled out, as in `-o my-dir/my-package.mojopkg`. * The Mojo Language Server now reports a warning when a local variable is unused. * Several `mojo` subcommands now support a `--diagnostic-format` option that changes the format with which errors, warnings, and other diagnostics are printed. By specifying `--diagnostic-format json` on the command line, errors and other diagnostics will be output in a structured [JSON Lines](https://jsonlines.org) format that is easier for machines to parse. The full list of subcommands that support `--diagnostic-format` is as follows: `mojo build`, `mojo doc`, `mojo run`, `mojo package`, and `mojo test`. Further, the `mojo test --json` option has been subsumed into this new option; for the same behavior, run `mojo test --diagnostic-format json`. Note that the format of the JSON output may change; we don't currently guarantee its stability across releases of Mojo. * A new `--validate-doc-strings` option has been added to `mojo` to emit errors on invalid doc strings instead of warnings. * The `--warn-missing-doc-strings` flag for `mojo` has been renamed to `--diagnose-missing-doc-strings`. * A new decorator, `@doc_private`, was added that can be used to hide a declaration from being generated in the output of `mojo doc`. It also removes the requirement that the declaration has documentation (for example, when used with `--diagnose-missing-doc-strings`). * Debugger users can now set breakpoints on function calls in O0 builds even if the call has been inlined by the compiler. * The Mojo Language Server now supports renaming local variables. ### Other changes #### ❌ Removed * The `@unroll` decorator has been deprecated and removed. The decorator was supposed to guarantee that a decorated loop would be unrolled, or else the compiler would error. In practice, this guarantee was eroded over time, as a compiler-based approach cannot be as robust as the Mojo parameter system. In addition, the `@unroll` decorator did not make the loop induction variables parameter values, limiting its usefulness. Please see `@parameter for` for a replacement! * The method `object.print()` has been removed. Since `object` now conforms to the `Stringable` trait, you can use `print(my_object)` instead. * The following functions have been removed from the math module: * `clamp()`; use the new `SIMD.clamp()` method instead. * `round_half_down()` and `round_half_up()`; these can be trivially implemented using the `ceil()` and `floor()` functions. * `add()`, `sub()`, `mul()`, `div()`, `mod()`, `greater()`, `greater_equal()`, `less()`, `less_equal()`, `equal()`, `not_equal()`, `logical_and()`, `logical_xor()`, and `logical_not()`; Instead, users should rely directly on the corresponding operators (`+`, `-`, `*`, `/`, `%`, `>`, `>=`, ` Int: @parameter if name == "r": return ... elif name == "g": return ... else: constrained[name == "b", "can only access with r, g, or b members"]() return ... var rgb = RGB() print(rgb.b) # Works print(rgb.q) # Compile error ``` * Mojo now allows users to capture the source location of code and call location of functions dynamically using the `__source_location()` and `__call_location()` functions. For example: ```mojo from builtin._location import __call_location @always_inline fn my_assert(cond: Bool, msg: String): if not cond: var call_loc = __call_location() print("In", call_loc.file_name, "on line", str(call_loc.line) + ":", msg) fn main(): my_assert(False, "always fails") # some_file.mojo, line 193 ``` This prints "`In /path/to/some_file.mojo on line 193: always fails`". Note that `__call_location()` only works in `@always_inline` or `@always_inline("nodebug")` functions. It gives incorrect results if placed in an `@always_inline` function that's called *from* an `@always_inline("nodebug")` function. This feature is still evolving and for the time being you need to explicitly import these APIs, as shown above. In the future, these will probably be built-in functions and not require an import statement. Neither `__source_location()` nor `__call_location()` work when called in a parameter context. For example: ```mojo from builtin._location import __call_location @always_inline fn mystery_location() -> String: var loc = __call_location() return str(loc.file_name) def main(): alias doesnt_work = mystery_location() # ``` ### Standard library changes #### ⭐️ New * [`List`](/mojo/stdlib/collections/list/List) has several new methods: * `pop(index)` for removing an element at a particular index. By default, `List.pop()` removes the last element in the list. (@LJ-9801, fixes [#2017](https://github.com/modular/modular/issues/2017)) * `resize(new_size)` for resizing the list without the need to specify an additional value. ([@mikowals](https://github.com/mikowals), fixes [#2133](https://github.com/modular/modular/issues/2133)) * `insert(index, value)` for inserting a value at a specified index into the `List`. ([@whym1here](https://github.com/whym1here), fixes [#2134](https://github.com/modular/modular/issues/2134)) * A new constructor `List(ptr, size, capacity)` to to avoid needing to do a deep copy of an existing contiguous memory allocation when constructing a new `List`. ([@StandinKP](https://github.com/StandinKP), fixes [#2170](https://github.com/modular/modular/issues/2170)) * [`Dict`](/mojo/stdlib/collections/dict/Dict) now has a `update()` method to update keys/values from another `Dict`. ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse)) * [`Set`](/mojo/stdlib/collections/set/Set) now has named methods for set operations: * `difference()` mapping to `-` * `difference_update()` mapping to `-=` * `intersection_update()` mapping to `&=` * `update()` mapping to `|=` ([@arvindavoudi](https://github.com/arvindavoudi)) * `Dict`, `List`, and `Set` all conform to the `Boolable` trait. The collections evaluate to `True` if they contain any elements, `False` otherwise: ```mojo def list_names(names: List[String]): if names: for name in names: print(name[]) else: print("No names to list.") ``` ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse)) * Added [`reversed()`](/mojo/stdlib/builtin/reversed/reversed) function for creating reversed iterators. Several range types, `List`, and `Dict` now support iterating in reverse. ```mojo var numbers = List(1, 2, 3, 4, 5) for number in reversed(numbers): print(number) ``` ([@helehex](https://github.com/helehex) and [@jayzhan211](https://github.com/jayzhan211), contributes towards [#2325](https://github.com/modular/modular/issues/2325)) * [`Optional`](/mojo/stdlib/collections/optional/Optional) now implements `__is__` and `__isnot__` methods so that you can compare an `Optional` with `None`. For example: ```mojo var opt = Optional(1) if opt is not None: print(opt.value()[]) ``` ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse)) * [`Tuple`](/mojo/stdlib/builtin/tuple/Tuple) now works with memory-only element types like `String` and allows you to directly index into it with a parameter expression. This means you can now simply use `x = tup[1]` like Python instead of `x = tup.get[1, Int]()`. You can also assign into tuple elements now as well with `tup[1] = x`. ```mojo var tuple = ("Green", 9.3) var name = tuple[0] var value = tuple[1] ``` Note that because the subscript must be a parameter expression, you can't iterate through a `Tuple` using an ordinary `for` loop. * The `Reference` type has several changes, including: * It has moved to the `memory.reference` module instead of `memory.unsafe`. * `Reference` now has an [`unsafe_bitcast()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#bitcast) method, similar to the pointer types. * Several unsafe methods were removed, including `offset()`, `destroy_element_unsafe()` and `emplace_ref_unsafe()`. This is because `Reference` is a safe type—use `UnsafePointer` to do unsafe operations. * [`Bool`](/mojo/stdlib/builtin/bool/Bool) can now be implicitly converted from any type conforming to the [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) trait. This means that you no longer need to write code like this: ```mojo @value struct MyBoolable: fn __bool__(self) -> Bool: ... fn takes_boolable[T: Boolable](cond: T): ... takes_boolable(MyBoolable()) ``` Instead, you can simply write: ```mojo fn takes_bool(cond: Bool): ... takes_bool(MyBoolable()) ``` Note that calls to `takes_bool()` will perform the implicit conversion, so in some cases is it still better to explicitly declare a type parameter, e.g.: ```mojo fn takes_two_boolables[T: Boolable](a: T, b: T): # Short circuit means `b.__bool__()` might not be evaluated. if a.__bool__() and b.__bool__(): ... ``` * [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) now conforms to the [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait, meaning that it can be used as key type for [`Dict`](/mojo/stdlib/collections/dict/Dict). This allows you to easily build and interact with Python dictionaries in Mojo: ```mojo def main(): d = PythonObject(Dict[PythonObject, PythonObject]()) d["foo"] = 12 d[7] = "bar" d["foo"] = [1, 2, "something else"] print(d) # prints `{'foo': [1, 2, 'something else'], 7: 'bar'}` ``` * [`FileHandle.seek()`](/mojo/stdlib/builtin/file/FileHandle#seek) now has a `whence` argument that defaults to `os.SEEK_SET` to seek from the beginning of the file. You can now set to `os.SEEK_CUR` to offset by the current `FileHandle` seek position: ```mojo var f = open("/tmp/example.txt") # Skip 32 bytes f.seek(os.SEEK_CUR, 32) ``` Or `os.SEEK_END` to offset from the end of file: ```mojo # Start from 32 bytes before the end of the file f.seek(os.SEEK_END, -32) ``` * [`FileHandle.read()`](/mojo/stdlib/builtin/file/FileHandle#read) can now read straight into a [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): ```mojo var file = open("/tmp/example.txt", "r") # Allocate and load 8 elements var ptr = DTypePointer[DType.float32].alloc(8) var bytes = file.read(ptr, 8) print("bytes read", bytes) print(ptr.load[width=8]()) ``` * The `sys` module now contains an `exit()` function that would exit a Mojo program with the specified error code. ```mojo from sys import exit exit(0) ``` * The constructors for [`Tensor`](/max/api/mojo/tensor/tensor/Tensor) have been changed to be more consistent. As a result, constructors take the shape as the first argument (instead of the second) when constructing a tensor with pointer data. If you pass a single scalar value to the `Tensor` constructor, it now broadcasts the value to all elements in the tensor. For example, `Tensor[DType.float32](TensorShape(2,2), 0)` constructs a `2x2` tensor initialized with all zeros. This provides an easy way to fill in the data of a tensor. * [`String`](/mojo/stdlib/collections/string/string/String) now has `removeprefix()` and `removesuffix()` methods. ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse)) * The [`ord`](/mojo/stdlib/collections/string/string/ord) and [`chr`](/mojo/stdlib/collections/string/string/chr) functions have been improved to accept any Unicode character. ([@mzaks](https://github.com/mzaks), contributes towards [#1616](https://github.com/modular/modular/issues/1616)) * [`atol()`](/mojo/stdlib/collections/string/string/atol) now handles whitespace. The `atol()`function is used internally by `String.__int__()`, so `int(String( " 10 "))` now returns `10` instead of raising an error. ([@artemiogr97](https://github.com/artemiogr97)) * [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) now implements the `__rmod__()` method. ([@bgreni](https://github.com/bgreni), fixes [#1482](https://github.com/modular/modular/issues/1482)) * [`bool(None)`](/mojo/stdlib/builtin/bool/bool-function) is now implemented. ([@zhoujingya](https://github.com/zhoujingya)) * The [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) type now implements `gather()` for gathering a `SIMD` vector from offsets of a current pointer. Similarly, support for `scatter()` was added to scatter a `SIMD` vector into offsets of the current pointer. ([@leandrolcampos](https://github.com/leandrolcampos)) * The [`len()`](/mojo/stdlib/builtin/len/len) function now handles a [`range()`](/mojo/stdlib/builtin/range/range) specified with a negative end value, so that things like `len(range(-1))` work correctly. ([@soraros](https://github.com/soraros)) * [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert) now prints its location (filename, line, and column where it was called) in its error message. Similarly, the `assert` helpers in the [`testing`](/mojo/stdlib/testing/testing/) module now include location information in their messages. * The [`testing.assert_equal[SIMD]()`](/mojo/stdlib/testing/testing/assert_equal) function now raises if any of the elements mismatch in the two `SIMD` arguments being compared. ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse)) * The [`testing.assert_almost_equal()`](/mojo/stdlib/testing/testing/assert_almost_equal) and [`math.isclose()`](/mojo/stdlib/math/math/isclose) functions now have an `equal_nan` flag. When set to `True`, then NaNs are considered equal. * The `object` type now supports the division, modulo, and left and right shift operators, including the in-place and reverse variants. (@LJ-9801, fixes [#2224](https://github.com/modular/modular/issues/2224)) * Added checked arithmetic operations for `SIMD` integers. `SIMD` integer types (including the sized integer scalars like `Int64`) can now perform checked additions, subtractions, and multiplications using the following new methods: * `add_with_overflow()` * `sub_with_overflow()` * `mul_with_overflow()` Checked arithmetic allows the caller to determine if an operation exceeded the numeric limits of the type. For example: ```mojo var simd = SIMD[DType.int8, 4](7, 11, 13, 17) var product: SIMD[DType.int8, 4] var overflow: SIMD[DType.bool, 4] (product, overflow) = simd.mul_with_overflow(simd) for i in range(len(product)): if overflow[i]: print("") else: print(product[i]) ``` ([@lsh](https://github.com/lsh)) * Added [`os.remove()`](/mojo/stdlib/os/os/remove) and [`os.unlink()`](/mojo/stdlib/os/os/unlink) for deleting files. ([@artemiogr97](https://github.com/artemiogr97), fixes [#2306](https://github.com/modular/modular/issues/2306)) #### 🦋 Changed * The [`parallel_memcpy()`](/mojo/stdlib/algorithm/memory/parallel_memcpy) function has moved from the `buffer` package to the `algorithm` package. Please update your imports accordingly. * [`Optional.value()`](/mojo/stdlib/collections/optional/Optional#value) now returns a reference instead of a copy of the contained value. To perform a copy manually, dereference the result: ```mojo var result = Optional(123) var value = result.value()[] ``` ([@lsh](https://github.com/lsh), fixes [#2179](https://github.com/modular/modular/issues/2179)) * Per the accepted community proposal, [Standardize the representation of byte sequence as a sequence of unsigned 8-bit integers](https://github.com/modular/modular/blob/main/mojo/proposals/byte-as-uint8.md), began transition to using `UInt8` by changing the data pointer of `Error` to `DTypePointer[DType.uint8]`. ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse), contributes towards [#2317](https://github.com/modular/modular/issues/2317)) * Continued transition to `UnsafePointer` from the legacy `Pointer` type in various standard library APIs and internals. ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse)) ### Tooling changes * The behavior of `mojo build` when invoked without an output `-o` argument has changed slightly: `mojo build ./test-dir/program.mojo` now outputs an executable to the path `./program`, whereas before it would output to the path `./test-dir/program`. * The `mojo package` command no longer supports the `-D` flag. All compilation environment flags should be provided at the point of package use (e.g. `mojo run` or `mojo build`). * The REPL no longer allows type level variable declarations to be uninitialized, e.g. it will reject `var s: String`. This is because it does not do proper lifetime tracking (yet!) across cells, and so such code would lead to a crash. You can work around this by initializing to a dummy value and overwriting later. This limitation only applies to top level variables, variables in functions work as they always have. ### Other changes #### Low-level language changes * A low-level `__get_mvalue_as_litref(x)` builtin was added to give access to the underlying memory representation as a `!lit.ref` value without checking initialization status of the underlying value. This is useful in very low-level logic but isn't designed for general usability and will likely change in the future. * Properties can now be specified on inline MLIR ops: ```mojo _ = __mlir_op.`kgen.source_loc`[ _type = ( __mlir_type.index, __mlir_type.index, __mlir_type.`!kgen.string` ), _properties = __mlir_attr.`{inlineCount = 1 : i64}`, ]() ``` As the example shows above, the protected `_properties` attribute can be passed during op construction, with an MLIR `DictionaryAttr` value. #### ❌ Removed * Support for "register only" variadic packs has been removed. Instead of `AnyRegType`, please upgrade your code to `AnyType` in examples like this: ```mojo fn your_function[*Types: AnyRegType](*args: *Ts): ... ``` This move gives you access to a nicer API and has the benefit of being memory safe and correct for non-trivial types. If you need specific APIs on the types, please use the correct trait instead of `AnyType`. * `List.pop_back()` has been removed. Use `List.pop()` instead which defaults to popping the last element in the list. * `SIMD.to_int(value)` has been removed. Use `int(value)` instead. * The `__get_lvalue_as_address(x)` magic function has been removed. To get a reference to a value use `Reference(x)` and if you need an unsafe pointer, you can use `UnsafePointer.address_of(x)`. #### 🛠️ Fixed * [#516](https://github.com/modular/modular/issues/516) and [#1817](https://github.com/modular/modular/issues/1817) and many others, e.g. "Can't create a function that returns two strings." * [#1178](https://github.com/modular/modular/issues/1178) (os/kern) failure (5). * [#1609](https://github.com/modular/modular/issues/1609) alias with `DynamicVector[Tuple[Int]]` fails. * [#1987](https://github.com/modular/modular/issues/1987) Defining `main` in a Mojo package is an error, for now. This is not intended to work yet, erroring for now will help to prevent accidental undefined behavior. * [#1215](https://github.com/modular/modular/issues/1215) and [#1949](https://github.com/modular/modular/issues/1949) The Mojo LSP server no longer cuts off hover previews for functions with functional arguments, parameters, or results. * [#1901](https://github.com/modular/modular/issues/1901) Fixed Mojo LSP and documentation generation handling of inout arguments. * [#1913](https://github.com/modular/modular/issues/1913) - `0__` no longer crashes the Mojo parser. * [#1924](https://github.com/modular/modular/issues/1924) JIT debugging on Mac has been fixed. * [#1941](https://github.com/modular/modular/issues/1941) Mojo variadic arguments don't work with non-trivial register-only types. * [#1963](https://github.com/modular/modular/issues/1963) `a!=0` is now parsed and formatted correctly by `mojo format`. * [#1676](https://github.com/modular/modular/issues/1676) Fix a crash related to `@value` decorator and structs with empty body. * [#1917](https://github.com/modular/modular/issues/1917) Fix a crash after syntax error during tuple creation. * [#2006](https://github.com/modular/modular/issues/2006) The Mojo LSP now properly supports signature types with named arguments and parameters. * [#2007](https://github.com/modular/modular/issues/2007) and [#1997](https://github.com/modular/modular/issues/1997) The Mojo LSP no longer crashes on certain types of closures. * [#1675](https://github.com/modular/modular/issues/1675) Ensure `@value` decorator fails gracefully after duplicate field error. * [#2068](https://github.com/modular/modular/issues/2068) Fix `SIMD.reduce()` for size\_out == 2. ([@soraros](https://github.com/soraros)) ## v24.2.1 (2024-04-11) This release doesn't include any changes to Mojo. ## v24.2 (2024-03-28) ### 🔥 Legendary * The Mojo standard library is now open source! Check out the [README](https://github.com/modular/modular/blob/main/mojo/stdlib/README.md) for everything you need to get started. * Structs and other nominal types are now allowed to implicitly conform to traits. A struct implicitly conforms to a trait if it implements all the requirements for the trait. For example, any struct that implements the `__str__()` method implicitly conforms to `Stringable`, and is usable with the `str()` built-in function. ```mojo @value struct Foo: fn __str__(self) -> String: return "foo!" fn main(): print(str(Foo())) # prints 'foo!' ``` We still strongly encourage you to explicitly list the traits a struct conforms to when possible: ```mojo @value struct Foo(Stringable): ... ``` Not only is this useful for documentation and for communicating intentions, but in the future, explicit conformance will be useful for features like default methods and extensions. * Mojo's Python interoperability now supports passing keyword arguments to Python functions: ```mojo from python import Python def main(): plt = Python.import_module("matplotlib.pyplot") plt.plot((5, 10), (10, 15), color="red") plt.show() ``` ### Language changes #### ⭐️ New * Mojo now has support for variadic keyword arguments, often referred to as `**kwargs`. This means you can now declare and call functions like this: ```mojo fn print_nicely(**kwargs: Int) raises: for key in kwargs.keys(): print(key[], "=", kwargs[key[]]) # prints: # `a = 7` # `y = 8` print_nicely(a=7, y=8) ``` For more details (and a list of current limitations), see [Variadic keyword arguments](/mojo/manual/functions#variadic-keyword-arguments) in the Mojo manual. #### 🦋 Changed or removed * `let` declarations now produce a compile time error instead of a warning, our next step in [removing let declarations](https://github.com/modular/modular/blob/main/mojo/proposals/remove-let-decls.md). The compiler still recognizes the `let` keyword for now in order to produce a good error message, but that will be removed in subsequent releases. * Mojo now warns about unused values in both `def` and `fn` declarations, instead of completely disabling the warning in `def`s. It never warns about unused `object` or `PythonObject` values, tying the warning to these types instead of the kind of function they are unused in. This will help catch API usage bugs in `def`s and make imported Python APIs more ergonomic in `fn`s. * For the time being, dynamic type values will be disabled in the language. For example, the following will now fail with an error: ```mojo var t = Int # dynamic type values not allowed struct SomeType: ... takes_type(SomeType) # dynamic type values not allowed ``` We want to take a step back and (re)design type valued variables, existentials, and other dynamic features. This does not affect type valued **parameters**, so the following works as before: ```mojo alias t = Int # still 🔥 struct SomeType: ... takes_type[SomeType]() # already 🔥 >fn uses_trait[T: SomeTrait](value: T): ... # still 🔥 ``` * The `*_` expression in parameter expressions is now required to occur at the end of a positional parameter list, instead of being allowed in the middle. ```mojo # No longer supported alias FirstUnbound = SomeStruct[*_, 42] alias MidUnbound = SomeStruct[7, *_, 6] # Still supported alias LastUnbound = SomeStruct[42, *_] ``` We narrowed this because we want to encourage type designers to get the order of parameters right, and want to extend `*_` to support keyword parameters as well in the future. ### Standard library changes #### ⭐️ New * `DynamicVector` has been renamed to [`List`](/mojo/stdlib/collections/list/List), and has moved from the `collections.vector` module to the `collections.list` module. In addition: * You can now construct a `List` from a variadic number of values. For example: ```mojo var numbers = List[Int](1, 2, 3) ``` * `List` and [`InlinedFixedVector`](/mojo/stdlib/collections/inline_array/InlineArray) types now support negative indexing. This means that you can write `vec[-1]` which is equivalent to `vec[len(vec)-1]`. * `List.push_back()` has been removed. Please use the `append()` function instead. * The [`print()`](/mojo/stdlib/builtin/io/print) function now takes `sep` and `end` keyword arguments. This means that you can write: ```mojo print("Hello", "Mojo", sep=", ", end="!!!\n") # prints Hello, Mojo!!! ``` `sep` defaults to the empty string and `end` defaults to "\n". Also, the `print_no_newline()` function has been removed. Please use `print(end="")` instead. * The [`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral) type is now an infinite-precision nonmaterializable type. This means you can do compile-time calculations using `FloatLiteral` without rounding errors. When materialized at runtime, a `FloatLiteral` value is converted to a [`Float64`](/mojo/stdlib/builtin/simd). ```mojo # third is an infinite-precision FloatLiteral value alias third = 1.0 / 3.0 # t is a Float64 var t = third ``` * String types all conform to the [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) trait. This means that you can now call `int("123")` to get the integer `123`. If the integer cannot be parsed from the string, then an error is raised. * The `Tensor` type now has `argmax()` and `argmin()` functions to compute the position of the max or min value. Note: this should return a `Tensor[Int]` but currently the output tensor is the same type as the input tensor. This will be fixed in a future release. * Added a new [`collections.OptionalReg`](/mojo/stdlib/collections/optional/OptionalReg) type, a register-passable alternative to [`Optional`](/mojo/stdlib/collections/optional/Optional). * The [`ulp()`](/mojo/stdlib/utils/numerics/ulp) function has been added to the `math` module. This allows you to get the units of least precision (or units of last place) of a floating point value. #### 🦋 Changed * The `simd_load()`, `simd_store()`, `aligned_simd_load()`, and `aligned_simd_store()` methods on [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer), [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer), and [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer) have been merged into a more expressive set of `load()` and `store()` methods with keyword-only `width` and `alignment` parameters: ```mojo # Doesn't work my_simd = my_buffer.simd_load[simd_width](index) # Works my_simd = my_buffer.load[width=simd_width](index) # Doesn't work my_buffer.aligned_simd_store[width, alignment](my_simd) # Works my_buffer.store[width=width, alignment=alignment](my_simd) ``` * The [`EqualityComparable`](/mojo/stdlib/builtin/equality_comparable/EqualityComparable) trait now requires the `__ne__()` method for conformance in addition to the previously required `__eq__()` method. * Many types now declare conformance to `EqualityComparable` trait. * [`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple) parameter order has changed to `StaticTuple[type, size]` for consistency with `SIMD` and similar collection types. * The signature of the [`elementwise()`](/mojo/stdlib/algorithm/functional/elementwise) function has been changed. The new order is is `function`, `simd_width`, and then `rank`. As a result, the rank parameter can now be inferred and one can call `elementwise()` without it: ```mojo elementwise[func, simd_width](shape) ``` * `PythonObject` is now register-passable. * `PythonObject.__iter__()` now works correctly on more types of iterable Python objects. Attempting to iterate over non-iterable objects will now raise an exception instead of behaving as if iterating over an empty sequence. `__iter__()` also now borrows `self` rather than requiring `inout`, allowing code like: ```mojo for value in my_dict.values(): ... ``` #### 🚚 Moved * We took the opportunity to rehome some modules into their correct package as we were going through the process of open-sourcing the Mojo standard library. Specifically, the following are some breaking changes worth calling out. Please update your import statements accordingly. * [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer), [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer), and friends have moved from the `memory` package into a new `buffer` package. ```mojo from buffer import Buffer, NDBuffer ``` * `utils.list`, including the [`Dim`](/mojo/stdlib/buffer/dimlist/Dim) and [`DimList`](/mojo/stdlib/buffer/dimlist/DimList) types, has moved to the `buffer` package. ```mojo from buffer import Dim, DimList ``` * The [`parallel_memcpy()`](/mojo/stdlib/algorithm/memory/parallel_memcpy) function has moved from the `memory` package into the `buffer` package. ```mojo from buffer import parallel_memcpy ``` * The [`rand()`](/max/api/mojo/tensor/tensor/Tensor/#rand) and [`randn()`](/max/api/mojo/tensor/tensor/Tensor/#randn) functions from the `random` package that return a `Tensor` have moved to the `tensor` package. Note that the overloads that write to a `DTypePointer` remain in the `random` package. If you happen to be using both versions in the same source file, you can import them both using the `import as` syntax: ```mojo from tensor import rand from random import rand as rand_dt ``` * The `trap()` function has been renamed to [`abort()`](/mojo/stdlib/os/os/abort). It also has moved from the `debug` module to the `os` module. ```mojo from os import abort ``` * The [`isinf()`](/mojo/stdlib/utils/numerics/isfinite) and [`isfinite()`](/mojo/stdlib/utils/numerics/isfinite) methods have been moved from `math.limits` to the `math` module. ```mojo from math import ininf, isfinite ``` ### Tooling changes #### ⭐️ New * Docstring code blocks can now use `%#` to hide lines of code from documentation generation. For example: ```mojo var value = 5 %# print(value) ``` Will generate documentation of the form: ```mojo var value = 5 ``` Hidden lines are processed as if they were normal code lines during test execution. This allows for writing additional code within a docstring example that is only used to ensure the example is runnable/testable. * The Mojo LSP server now allow you to specify additional search paths to use when resolving imported modules in a document. You can specify search paths on the command line, using the `-I` option, or you can add them to the `mojo.lsp.includeDirs` setting in the VS Code extension. ### Other changes #### ❌ Removed * The `__get_address_as_lvalue` magic function has been removed. You can now get an LValue from a `Pointer` or `Reference` by using the dereference operator (`[]`): ```mojo var ptr: Pointer[MyRecord] ... # Doesn't work __get_address_as_lvalue(ptr.value) = MyRecord(3, 5) # Works ptr[] = MyRecord(3, 5) ``` * The type parameter for the `memcpy` function is now automatically inferred. This means that calls to `memcpy` of the form `memcpy[Dtype.xyz](...)` will no longer work and the user would have to change the code to `memcpy(...)`. * The [`memcpy()`](/mojo/stdlib/memory/memory/memcpy) overload that worked on [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer) types has been removed in favor of just overloads for [`Pointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): ```mojo # Doesn't work memcpy(destBuffer, srcBuffer, count) # Works memcpy(destBuffer.data, srcBuffer.data, count) ``` * The functions `max_or_inf()`, `min_or_neginf()` have been removed from `math.limit`. These functions were only used by the SIMD type. * As mentioned previously, the `print_no_newline()` function has been removed. Please use `print(end="")` instead. #### 🛠️ Fixed * [#1362](https://github.com/modular/modular/issues/1362) - Parameter inference now recursively matches function types. * [#951](https://github.com/modular/modular/issues/951) - Functions that were both `async` and `@always_inline` incorrectly errored. * [#1858](https://github.com/modular/modular/issues/1858) - Trait with parametric methods regression. * [#1892](https://github.com/modular/modular/issues/1892) - Forbid unsupported decorators on traits. * [#1735](https://github.com/modular/modular/issues/1735) - Trait-typed values are incorrectly considered equal. * [#1909](https://github.com/modular/modular/issues/1909) - Crash due to nested import in unreachable block. * [#1921](https://github.com/modular/modular/issues/1921) - Parser crashes binding `Reference` to lvalue with subtype lifetime. * [#1945](https://github.com/modular/modular/issues/1945) - `Optional[T].or_else()` should return `T` instead of `Optional[T]`. * [#1940](https://github.com/modular/modular/issues/1940) - Constrain `math.copysign` to floating point or integral types. * [#1838](https://github.com/modular/modular/issues/1838) - Variadic `print` does not work when specifying `end=""` * [#1826](https://github.com/modular/modular/issues/1826) - The `SIMD.reduce` methods correctly handle edge cases where `size_out >= size`. ## v24.1.1 (2024-03-18) This release includes installer improvements and enhanced error reporting for installation issues. Otherwise it is functionally identical to Mojo 24.1. ## v24.1 (2024-02-29) ### 🔥 Legendary * Mojo is now bundled with [the MAX platform](/max)! As such, the Mojo package version now matches the MAX version, which follows a `YY.MAJOR.MINOR` version scheme. Because this is our first release in 2024, that makes this version `24.1`. * Mojo debugging support is here! The Mojo VS Code extension includes debugger support. For details, see [Debugging](/mojo/tools/debugging) in the Mojo Manual. ### ⭐️ New * We now have a [`Set`](/mojo/stdlib/collections/set/Set) type in our collections! `Set` is backed by a `Dict`, so it has fast add, remove, and `in` checks, and requires member elements to conform to the `KeyElement` trait. ```mojo from collections import Set var set = Set[Int](1, 2, 3) print(len(set)) # 3 set.add(4) for element in set: print(element[]) set -= Set[Int](3, 4, 5) print(set == Set[Int](1, 2)) # True print(set | Set[Int](0, 1) == Set[Int](0, 1, 2)) # True let element = set.pop() print(len(set)) # 1 ``` * Mojo now supports the `x in y` expression as syntax sugar for `y.__contains__(x)` as well as `x not in y`. * Mojo now has support for keyword-only arguments and parameters. For example: ```mojo fn my_product(a: Int, b: Int = 1, *, c: Int, d: Int = 2): print(a * b * c * d) my_product(3, c=5) # prints '30' my_product(3, 5, d=7) # error: missing 1 required keyword-only argument: 'c' ``` This includes support for declaring signatures that use both variadic and keyword-only arguments/parameters. For example, the following is now possible: ```mojo fn prod_with_offset(*args: Int, offset: Int = 0) -> Int: var res = 1 for i in range(len(args)): res *= args[i] return res + offset print(prod_with_offset(2, 3, 4, 10)) # prints 240 print(prod_with_offset(2, 3, 4, offset=10)) # prints 34 ``` Note that variadic keyword-only arguments/parameters (for example, `**kwargs`) are not supported yet. That is, the following is not allowed: ```mojo fn variadic_kw_only(a: Int, **kwargs): ... ``` For more information, see [Positional-only and keyword-only arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments) in the Mojo Manual. * The `print()` function now accepts a keyword-only argument for the `end` which is useful for controlling whether a newline is printed or not after printing the elements. By default, `end` defaults to "\n" as before. * The Mojo SDK can now be installed on AWS Graviton instances. * A new version of the [Mojo Playground](https://developer.modular.com/playground) is available. The new playground is a simple interactive editor for Mojo code, similar to the Rust Playground or Go Playground. The old JupyterLab based playground will remain online until March 20th. * The Mojo LSP server will now generate fixits for populating empty documentation strings: ```mojo fn foo(arg: Int): """""" # Unexpected empty documentation string ``` Applying the fixit from above will generate: ```mojo fn foo(arg: Int): """[summary]. Args: arg: [description]. """ ``` * Added new `*_` syntax that allows users to explicitly unbind any number of positional parameters. For example: ```mojo struct StructWithDefault[a: Int, b: Int, c: Int = 8, d: Int = 9]: pass alias all_unbound = StructWithDefault[*_] # equivalent to alias all_unbound = StructWithDefault[_, _, _, _] alias first_bound = StructWithDefault[5, *_] # equivalent to alias first_bound = StructWithDefault[5, _, _, _] alias last_bound = StructWithDefault[*_, 6] # equivalent to alias last_bound = StructWithDefault[_, _, _, 6] alias mid_unbound = StructWithDefault[3, *_, 4] # equivalent to alias mid_unbound = StructWithDefault[3, _, _, 4] ``` As demonstrated above, this syntax can be used to explicitly unbind an arbitrary number of parameters, at the beginning, at the end, or in the middle of the operand list. Since these unbound parameters must be explicitly specified at some point, default values for these parameters are not applied. For example: ```mojo alias last_bound = StructWithDefault[*_, 6] # When using last_bound, you must specify a, b, and c. last_bound # doesn't have a default value for `c`. var s = last_bound[1, 2, 3]() ``` For more information see the Mojo Manual sections on [partially-bound types](/mojo/manual/parameters/#fully-bound-partially-bound-and-unbound-types) and [automatic parameterization of functions](/mojo/manual/parameters/#automatic-parameterization-of-functions). * [`DynamicVector`](/mojo/stdlib/collections/list/List) now supports iteration. Iteration values are instances of `Reference` and require dereferencing: ```mojo var v: DynamicVector[String]() v.append("Alice") v.append("Bob") v.append("Charlie") for x in v: x[] = str("Hello, ") + x[] for x in v: print(x[]) ``` * `DynamicVector` now has [`reverse()`](/mojo/stdlib/collections/list/List#reverse) and [`extend()`](/mojo/stdlib/collections/list/List#extend) methods. * The `mojo package` command now produces compilation agnostic packages. Compilation options such as O0, or --debug-level, are no longer needed or accepted. As a result, packages are now smaller, and extremely portable. * Initializers for `@register_passable` values can (and should!) now be specified with `inout self` arguments just like memory-only types: ```mojo @register_passable struct YourPair: var a: Int var b: Int fn __init__(inout self): self.a = 42 self.b = 17 fn __copyinit__(inout self, existing: Self): self.a = existing.a self.b = existing.b ``` This form makes the language more consistent, more similar to Python, and easier to implement advanced features for. There is also no performance impact of using this new form: the compiler arranges to automatically return the value in a register without requiring you to worry about it. The older `-> Self` syntax is still supported in this release, but will be removed in a subsequent one, so please migrate your code. One thing to watch out for: a given struct should use one style or the other, mixing some of each won't work well. * The `inout self` initializer form is **required** for initializers of `@register_passable` types that may raise errors: ```mojo @register_passable struct RaisingCtor: fn __init__(inout self) raises: raise ``` * `async` functions that may raise errors have been temporarily disabled in this build. The implementation of Mojo async is undergoing a rework 🚧. * The standard library `slice` type has been renamed to [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice), and a `slice` function has been introduced. This makes Mojo closer to Python and makes the `Slice` type follow the naming conventions of other types like `Int`. * "Slice" syntax in subscripts is no longer hard coded to the builtin `slice` type: it now works with any type accepted by a container's `__getitem__()` method. For example: ```mojo @value struct UnusualSlice: var a: Int var b: Float64 var c: String struct YourContainer: fn __getitem__(self, slice: UnusualSlice) -> T: ... ``` Given this implementation, you can subscript into an instance of `YourContainer` like `yc[42:3.14:"🔥"]` and the three values are passed to the `UnusualSlice` constructor. * The `__refitem__()` accessor method may now return a `Reference` instead of having to return an MLIR internal reference type. * Added [`AnyPointer.move_into()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#move_pointee_into) method, for moving a value from one pointer memory location to another. * Added built-in [`hex()`](/mojo/stdlib/builtin/format_int/hex) function, which can be used to format any value whose type implements the [`Intable`](/mojo/stdlib/builtin/int/Intable) trait as a hexadecimal string. * [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) now implements `__is__` and `__isnot__` so that you can use expressions of the form `x is y` and `x is not y` with `PythonObject`. * [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) now conforms to the `SizedRaising` trait. This means the built-in [`len()`](/mojo/stdlib/builtin/len/len) function now works on `PythonObject`. * The `os` package now contains the [`stat()`](/mojo/stdlib/os/fstat/stat) and [`lstat()`](/mojo/stdlib/os/fstat/lstat) functions. * A new [`os.path`](/mojo/stdlib/os/path/path) package now allows you to query properties on paths. * The `os` package now has a [`PathLike`](/mojo/stdlib/os/pathlike/PathLike) trait. A struct conforms to the `PathLike` trait by implementing the `__fspath__()` function. * The [`pathlib.Path`](/mojo/stdlib/pathlib/path/Path) now has functions to query properties of the path. * The [`listdir()`](/mojo/stdlib/pathlib/path/Path#listdir) method now exists on [`pathlib.Path`](/mojo/stdlib/pathlib/path) and also exists in the `os` module to work on `PathLike` structs. For example, the following sample lists all the directories in the `/tmp` directory: ```mojo from pathlib import Path fn walktree(top: Path, inout files: DynamicVector[Path]): try: var ls = top.listdir() for i in range(len(ls)): var child = top / ls[i] if child.is_dir(): walktree(child, files) elif child.is_file(): files.append(child) else: print("Skipping '" + str(child) + "'") except: return fn main(): var files = DynamicVector[Path]() walktree(Path("/tmp"), files) for i in range(len(files)): print(files[i]) ``` * The [`find()`](/mojo/stdlib/builtin/string_literal/StringLiteral#find), [`rfind()`](/mojo/stdlib/builtin/string_literal/StringLiteral#rfind), [`count()`](/mojo/stdlib/collections/string/string_slice/StringSlice#count), and [`__contains__()`](/mojo/stdlib/builtin/string_literal/StringLiteral#__contains__) methods now work on string literals. This means that you can write: ```mojo if "Mojo" in "Hello Mojo": ... ``` * Breakpoints can now be inserted programmatically within the code using the builtin [`breakpoint()`](/mojo/stdlib/builtin/breakpoint/breakpoint) function. Note: on Graviton instances, the debugger might not be able to resume after hitting this kind of breakpoint. * Added a builtin [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) trait that describes a type that can be represented as a boolean value. To conform to the trait, a type must implement the `__bool__()` method. * Modules within packages can now use purely relative `from` imports: ```mojo from . import another_module ``` * Trivial types, like MLIR types and function types, can now be bound implicitly to traits that require copy constructors or move constructors, such as [`Movable`](/mojo/stdlib/builtin/value/Movable), [`Copyable`](/mojo/stdlib/builtin/value/Copyable), and [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement). * A new magic `__origin_of(expr)` call will yield the lifetime of a memory value. We hope and expect that this will eventually be replaced by `Reference(expr).lifetime` as the parameter system evolves, but this is important in the meantime for use in function signatures. * A new magic `__type_of(expr)` call will yield the type of a value. This allows one to refer to types of other variables. For example: ```mojo fn my_function(x: Int, y: __type_of(x)) -> Int: let z: __type_of(x) = y return z ``` ### 🦋 Changed * As another step towards [removing let declarations](https://github.com/modular/modular/blob/main/mojo/proposals/remove-let-decls.md) we have removed support for let declarations inside the compiler. To ease migration, we parse `let` declarations as a `var` declaration so your code won't break. We emit a warning about this, but please switch your code to using `var` explicitly, because this migration support will be removed in a subsequent update. ```mojo fn test(): # treated as a var, but please update your code! let x = 42 # warning: 'let' is being removed, please use 'var' instead x = 9 ``` * It is no longer possible to explicitly specify implicit argument parameters in [automatically parameterized functions](/mojo/manual/parameters/#automatic-parameterization-of-functions). This ability was an oversight and this is now an error: ```mojo fn autoparameterized(x: SIMD): pass autoparameterized[DType.int32, 1](3) # error: too many parameters ``` * `vectorize_unroll` has been removed, and [`vectorize`](/mojo/stdlib/algorithm/functional/vectorize) now has a parameter named `unroll_factor` with a default value of 1. Increasing `unroll_factor` may improve performance at the cost of binary size. See the [loop unrolling blog here](https://www.modular.com/blog/what-is-loop-unrolling-how-you-can-speed-up-mojo) for more details. * The `vectorize` signatures have changed with the closure `func` moved to the first parameter: ```mojo vectorize[func, width, unroll_factor = 1](size) vectorize[func, width, size, unroll_factor = 1]() ``` The doc string has been updated with examples demonstrating the difference between the two signatures. * The `unroll` signatures have changed with the closure `func` moved to the first parameter: ```mojo unroll[func, unroll_count]() ``` * The signature of the [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer) and [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer) types have changed. Now, both take the type as the first parameter and no longer require the shape parameter. This allows you to use these types and have sensible defaults. For example: ```mojo NDBuffer[DType.float32, 3] ``` is equivalent to ```mojo NDBuffer[DType.float32, 3, DimList.create_unknown[3]()] ``` Users can still specify the static shape (if known) to the type: ```mojo NDBuffer[DType.float32, 3, DimList(128, 128, 3)] ``` * The error message for missing function arguments is improved: instead of describing the number of arguments (e.g. `callee expects at least 3 arguments, but 1 was specified`) the missing arguments are now described by name (e.g. `missing 2 required positional arguments: 'b', 'c'`). * The [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement) trait is now a built-in trait and has been removed from `collections.vector`. * The `DynamicVector(capacity: Int)` constructor has been changed to take `capacity` as a keyword-only argument to prevent implicit conversion from `Int`. * [`Variant.get[T]()`](/mojo/stdlib/utils/variant/Variant#__getitem__) now returns a `Reference` to the value rather than a copy. * The [`String`](/mojo/stdlib/collections/string/string/String) methods `tolower()` and `toupper()` have been renamed to `str.lower()` and `str.upper()`. * The `ref` and `mutref` identifiers are no longer reserved as Mojo keywords. We originally thought about using those as language sugar for references, but we believe that generic language features combined with the [`Reference`](/mojo/stdlib/memory/pointer/Pointer) type will provide a good experience without dedicated sugar. ### 🛠️ Fixed * [#435](https://github.com/modular/modular/issues/435) Structs with Self type don't always work. * [#1540](https://github.com/modular/modular/issues/1540) Crash in register\_passable self referencing struct. * [#1664](https://github.com/modular/modular/issues/1664) - Improve error message when `StaticTuple` is constructed with a negative size for the number of elements. * [#1679](https://github.com/modular/modular/issues/1679) - crash on SIMD of zero elements. * Various crashes on invalid code: [#1230](https://github.com/modular/modular/issues/1230), [#1699](https://github.com/modular/modular/issues/1699), [#1708](https://github.com/modular/modular/issues/1708) * [#1223](https://github.com/modular/modular/issues/1223) - Crash when parametric function is passed as (runtime) argument. The parser now errors out instead. * [#1530](https://github.com/modular/modular/issues/1530) - Crash during diagnostic emission for parameter deduction failure. * [#1538](https://github.com/modular/modular/issues/1538) and [#1607](https://github.com/modular/modular/issues/1607) - Crash when returning type value instead of instance of expected type. This is a common mistake and the error now includes a hint to point users to the problem. * [#1613](https://github.com/modular/modular/issues/1613) - Wrong type name in error for incorrect `self` argument type in trait method declaration. * [#1670](https://github.com/modular/modular/issues/1670) - Crash on implicit conversion in a global variable declaration. * [#1741](https://github.com/modular/modular/issues/1741) - Mojo documentation generation doesn't show `inout`/`owned` on variadic arguments. * [#1621](https://github.com/modular/modular/issues/1621) - VS Code does not highlight `raises` and `capturing` in functional type expressions. * [#1617](https://github.com/modular/modular/issues/1617) - VS Code does not highlight `fn` in specific contexts. * [#1740](https://github.com/modular/modular/issues/1740) - LSP shows unrelated info when hovering over a struct. * [#1238](https://github.com/modular/modular/issues/1238) - File shadows Mojo package path. * [#1429](https://github.com/modular/modular/issues/1429) - Crash when using nested import statement. * [#1322](https://github.com/modular/modular/issues/1322) - Crash when missing types in variadic argument. * [#1314](https://github.com/modular/modular/issues/1314) - Typecheck error when binding alias to parametric function with default argument. * [#1248](https://github.com/modular/modular/issues/1248) - Crash when importing from file the same name as another file in the search path. * [#1354](https://github.com/modular/modular/issues/1354) - Crash when importing from local package. * [#1488](https://github.com/modular/modular/issues/1488) - Crash when setting generic element field. * [#1476](https://github.com/modular/modular/issues/1476) - Crash in interpreter when calling functions in parameter context. * [#1537](https://github.com/modular/modular/issues/1537) - Crash when copying parameter value. * [#1546](https://github.com/modular/modular/issues/1546) - Modify nested vector element crashes parser. * [#1558](https://github.com/modular/modular/issues/1558) - Invalid import causes parser to crash. * [#1562](https://github.com/modular/modular/issues/1562) - Crash when calling parametric type member function. * [#1577](https://github.com/modular/modular/issues/1577) - Crash when using unresolved package as a variable. * [#1579](https://github.com/modular/modular/issues/1579) - Member access into type instances causes a crash. * [#1602](https://github.com/modular/modular/issues/1602) - Interpreter failure when constructing strings at compile time. * [#1696](https://github.com/modular/modular/issues/1696) - Fixed an issue that caused syntax highlighting to occasionally fail. * [#1549](https://github.com/modular/modular/issues/1549) - Fixed an issue when the shift amount is out of range in `SIMD.shift_left` and `SIMD.shift_right`. ## v0.7.0 (2024-01-25) ### ⭐️ New * A new Mojo-native dictionary type, [`Dict`](/mojo/stdlib/collections/dict) for storing key-value pairs. `Dict` stores values that conform to the [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement) trait. Keys need to conform to the new [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait, which is not yet implemented by other standard library types. In the short term, you can create your own wrapper types to use as keys. For example, the following sample defines a `StringKey` type and uses it to create a dictionary that maps strings to `Int` values: ```mojo from collections.dict import Dict, KeyElement @value struct StringKey(KeyElement): var s: String fn __init__(inout self, owned s: String): self.s = s ^ fn __init__(inout self, s: StringLiteral): self.s = String(s) fn __hash__(self) -> Int: return hash(self.s) fn __eq__(self, other: Self) -> Bool: return self.s == other.s fn main() raises: var d = Dict[StringKey, Int]() d["cats"] = 1 d["dogs"] = 2 print(len(d)) # prints 2 print(d["cats"]) # prints 1 print(d.pop("dogs")) # prints 2 print(len(d)) # prints 1 ``` We plan to add `KeyElement` conformance to standard library types in subsequent releases. * Users can opt-in to assertions used in the standard library code by specifying `-D MOJO_ENABLE_ASSERTIONS` when invoking `mojo` to compile your source file(s). In the case that an assertion is fired, the assertion message will be printed along with the stack trace before the program exits. By default, assertions are *not enabled* in the standard library right now for performance reasons. * The Mojo Language Server now implements the References request. IDEs use this to provide support for **Go to References** and **Find All References**. A current limitation is that references outside of the current document are not supported, which will be addressed in the future. * The [`sys.info`](/mojo/stdlib/sys/info) module now includes `num_physical_cores()`, `num_logical_cores()`, and `num_performance_cores()` functions. * Homogeneous variadic arguments consisting of memory-only types, such as `String` are more powerful and easier to use. These arguments are projected into a [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem). (Previous releases made it easier to use variadic lists of register-passable types, like `Int`.) Subscripting into a `VariadicListMem` now returns the element instead of an obscure internal type. In addition, we now support `inout` and `owned` variadic arguments: ```mojo fn make_worldly(inout *strs: String): # This "just works" as you'd expect! for i in range(len(strs)): strs[i] += " world" fn main(): var s1: String = "hello" var s2: String = "konnichiwa" var s3: String = "bonjour" make_worldly(s1, s2, s3) print(s1) # hello world print(s2) # konnichiwa world print(s3) # bonjour world ``` (Previous releases made it easier to use variadic lists, but subscripting into a `VariadicListMem` returned a low-level pointer, which required the user to call `__get_address_as_lvalue()` to access the element.) Note that subscripting the variadic list works nicely as above, but iterating over the variadic list directly with a `for` loop produces a `Reference` (described below) instead of the desired value, so an extra subscript is required; We intend to fix this in the future. ```mojo fn make_worldly(inout *strs: String): # Requires extra [] to dereference the reference for now. for i in strs: i[] += " world" ``` Heterogeneous variadic arguments have not yet been moved to the new model, but will in future updates. Note that for variadic arguments of register-passable types like `Int`, the variadic list contains values, not references, so the dereference operator (`[]`) is not required. This code continues to work as it did previously: ```mojo fn print_ints(*nums: Int): for num in nums: print(num) print(len(nums)) ``` * Mojo now has a prototype version of a safe [`Reference`](/mojo/stdlib/memory/pointer/Pointer) type. The compiler's lifetime tracking pass can reason about references to safely extend local variable lifetime, and check indirect access safety. The `Reference` type is brand new (and currently has no syntactic sugar) so it must be explicitly dereferenced with an empty subscript: `ref[]` provides access to the underlying value. ```mojo fn main(): var a: String = "hello" var b: String = " references" var aref = Reference(a) aref[] += b print(a) # prints "hello references" aref[] += b # ^last use of b, it is destroyed here. print(aref[]) # prints "hello references references" # ^last use of a, it is destroyed here. ``` While the `Reference` type has the same in-memory representation as a C pointer or the Mojo `Pointer` type, it also tracks a symbolic "lifetime" value so the compiler can reason about the potentially accessed set of values. This lifetime is part of the static type of the reference, so it propagates through generic algorithms and abstractions built around it. The `Reference` type can form references to both mutable and immutable memory objects, e.g. those on the stack or borrowed/inout/owned function arguments. It is fully parametric over mutability, eliminating the [problems with code duplication due to mutability specifiers](https://duckki.github.io/2024/01/01/inferred-mutability.html) and provides the base for unified user-level types. For example, it could be used to implement an array slice object that handles both mutable and immutable array slices. While this is a major step forward for the lifetimes system in Mojo, it is still *very* early and awkward to use. Notably, there is no syntactic sugar for using references, such as automatic dereferencing. Several aspects of it need to be more baked. It is getting exercised by variadic memory arguments, which is why they are starting to behave better now. Note: the safe `Reference` type and the unsafe pointer types are defined in the same module, currently named `memory.unsafe`. We expect to restructure this module in a future release. * Mojo now allows types to implement `__refattr__()` and `__refitem__()` to enable attribute and subscript syntax with computed accessors that return references. For common situations where these address a value in memory this provides a more convenient and significantly more performant alternative to implementing the traditional get/set pairs. Note: this may be changed in the future when references auto-dereference—at that point we may switch to just returning a reference from `__getattr__()`. * Parametric closures can now capture register passable typed values by copy using the `__copy_capture` decorator. For example, the following code will print `5`, not `2`. ```mojo fn foo(x: Int): var z = x @__copy_capture(z) @parameter fn formatter() -> Int: return z z = 2 print(formatter()) fn main(): foo(5) ``` * String now implements KeyElement and may be used as a key in Dict. * More robust support for structs with fields of self referencing types. For example, the following code will work and print `0`: ```mojo struct Foo(CollectionElement): var vec: DynamicVector[Self] fn __init__(inout self: Self): self.vec = DynamicVector[Self]() fn __moveinit__(inout self: Self, owned existing: Self): self.vec = existing.vec ^ fn __copyinit__(inout self: Self, existing: Self): self.vec = existing.vec fn main(): var foo = Foo() print(len(foo.vec)) ``` ### ❌ Removed * The `__takeinit__` special constructor form has been removed from the language. This "non-destructive move" operation was previously wired into the `x^` transfer operator, but had unpredictable behavior that wasn't consistent. Now that Mojo has traits, it is better to model this as an explicit `.take()` operation on a type, which would transfer out the contents of the type without ending its lifetime. For example, for a type that holds a pointer, `take()` might return a new instance pointing to the same data, and null out its own internal pointer. This change makes it clear when a lifetime is ended versus when the contents of an LValue are explicitly taken. * The current implementation of autotuning has been deprecated, as Mojo's autotuning implementation is undergoing a redesign. Tutorials around the current implementation have also been removed as they are being rewritten. Consequently, the `autotune()`, `autotune_fork()`, and `search()` functions have been removed from the standard library. * The `_OldDynamicVector` type that worked only on register passable element types has been removed. Please migrate uses to [`DynamicVector`](/mojo/stdlib/collections/list/List) which works on both register passable and memory types. * The `UnsafeFixedVector` in `utils.vector` has been removed. We recommend using either [`DynamicVector`](/mojo/stdlib/collections/list/List) or [`InlinedFixedVector`](/mojo/stdlib/collections/inline_array/InlineArray) instead. * The `@adaptive` decorator has been removed from the language. Any uses of the decorator in a non-search context can be replaced with `@parameter if`. For example: ```mojo @adaptive fn foo[a: Bool](): constrained[a]() body1() @adaptive fn foo[a: Bool](): constrained[not a]() body2() ``` Can be rewritten as: ```mojo fn foo[a: Bool](): @parameter if a: body1() else: body2() ``` Consequently, the special `__adaptive_set` attribute has been removed as well. * Result parameters have been removed from Mojo. Result parameter declarations in function parameter lists are no longer allowed, nor are forward alias declarations. This includes removing the `param_return` statement. * The `@noncapturing` and `@closure` decorators have been removed due to refinements and improvements to the closure model. See below for more details! ### 🦋 Changed * The Mojo closure model has been refined to be more straightforward and safe. Mojo has two closure types: parameter closures and runtime closures. Parameter closures can be used in higher-order functions and are the backbone of functions like `vectorize` and `parallelize`. They are always denoted by `@parameter` and have type `fn() capturing -> T` (where `T` is the return type). On the other hand, runtime closures are always dynamic values, capture values by invoking their copy constructor, and retain ownership of their capture state. You can define a runtime closure by writing a nested function that captures values: ```mojo fn outer(b: Bool, x: String) -> fn() escaping -> None: fn closure(): print(x) # 'x' is captured by calling String.__copyinit__ fn bare_function(): print("hello") # nothing is captured if b: # closure can be safely returned because it owns its state return closure^ # function pointers can be converted to runtime closures return bare_function ``` The type of runtime closures are of the form `fn() escaping -> T`. You can pass equivalent function pointers as runtime closures. Stay tuned for capture list syntax for move capture and capture by reference, and a more unified closure model! * The `@unroll(n)` decorator can now take a parameter expression for the unroll factor, i.e. `n` can be a parameter expression that is of integer type. * The `cpython` module in the `python` package has been moved to be an internal module, i.e, `_cpython`. * `AnyType` and `Destructable` have been unified into a single trait, `AnyType`. Every nominal type (i.e. all structs) now automatically conform to `AnyType`. * Previously, the `mojo package` command would output a Mojo package that included both partly-compiled Mojo code, as well as fully-compiled machine code for a specific computer architecture -- the architecture of the machine being used to invoke the `mojo package` command. Now, `mojo package` only includes partly-compiled Mojo code. It is only fully compiled for the specific computer architecture being used at the point that the package is first `import`-ed. As a result, Mojo packages are smaller and more portable. * The `simd_width` and `dtype` parameters of `polynomial_evaluate` have been switched. Based on the request in [#1587](https://github.com/modular/modular/issues/1587), the `polynomial_evaluate` function has also been extended so that the `coefficients` parameter can take either a either a [`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple) or a [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList). * As a tiny step towards removing `let` declarations, this release removes the warning: `'var' was never mutated, consider switching to a 'let'`. ### 🛠️ Fixed * [#1595](https://github.com/modular/modular/issues/1595) - Improve error message when trying to materialize `IntLiteral` in runtime code. * Raising an error from the initializer of a memory-only type now works correctly in the presence of complex control flow. Previously Mojo could run the destructor on `self` before it was initialized when exiting with an error. * [#1096](https://github.com/modular/modular/issues/1096) - Improve warning messages for dead code in conditionals like `or` expressions. * [#1419](https://github.com/modular/modular/issues/1419) - Fix assertion failure with uninitialized lattice values. * [#1402](https://github.com/modular/modular/issues/1402) - Fix movable trait not detected on recursive struct implemented with `AnyPointer`. * [#1399](https://github.com/modular/modular/issues/1399) - Fix parser crash when a parameter type in a struct that implements a trait is misspelled. * [#1152](https://github.com/modular/modular/issues/1152) - Allow mutable `self` argument when overloading operators using dunder methods. * [#1493](https://github.com/modular/modular/issues/1493) - Fix crash in `DynamicVector` copy constructor in certain situations. * [#1316](https://github.com/modular/modular/issues/1316) - The `benchmark.keep` function now properly handles vector types. * [#1505](https://github.com/modular/modular/issues/1505) - The `simd.shuffle` operation now works on 64 element permutations. * [#1355](https://github.com/modular/modular/issues/1355) - Fix `String.find()` returning wrong value when starting index is non-zero. * [#1367](https://github.com/modular/modular/issues/1367) - Fix `String.replace()` returning incorrect results for multi-character search strings. * [#1535](https://github.com/modular/modular/issues/1535) - Invalid error `field 'w.x.y' destroyed out of the middle of a value, preventing the overall value from being destroyed`. * [#1475](https://github.com/modular/modular/issues/1475) - Assertion failure in nested loop. * [#1591](https://github.com/modular/modular/issues/1591) - Assertion failure when using `AnyType` struct member. * [#1503](https://github.com/modular/modular/issues/1503) - Rename the mojo build of LLDB to `mojo-lldb`, to prevent name collisions with the system's LLDB. * [#1542](https://github.com/modular/modular/issues/1542) - `@unroll` does not accept alias as unroll factor. * [#1443](https://github.com/modular/modular/issues/1443) - Compiler crash on variadic list of traits. * [#1604](https://github.com/modular/modular/issues/1604) - Variable of trivial type not destroyed by transferring ownership. * [#1341](https://github.com/modular/modular/issues/1341) - Segmentation fault when passing closures around. * [#217](https://github.com/modular/modular/issues/217) - Closure state is stack allocated. ## v0.6.1 (2023-12-18) ### ⭐️ New * The Mojo REPL now provides limited support for the `%cd` magic command. This command automatically maintains an internal stack of directories you visit during the REPL session. Usage: * `%cd 'dir'`: change to directory `dir` and push it on the directory stack. * `%cd -`: pop the directory stack and change to the last visited directory. * Structs decorated with `@value` now automatically conform to the [`Movable`](/mojo/stdlib/builtin/value/Movable) and [`Copyable`](/mojo/stdlib/builtin/value/Copyable) built-in traits. * [`String`](/mojo/stdlib/collections/string/string/String) now has new [`toupper()`](/mojo/stdlib/collections/string/string/String#upper) and [`tolower()`](/mojo/stdlib/collections/string/string/String#lower) methods analogous, respectively, to Python's `str.toupper()` and `str.tolower()`. * Added a [`hash()`](/mojo/stdlib/hashlib/hash/hash) built-in function and [`Hashable`](/mojo/stdlib/hashlib/hash/Hashable) trait for types implementing the `__hash__()` method. Future releases will add `Hashable` support to Standard Library types. In the meantime, the `hash` module includes a version of the `hash()` function that works on arbitrary byte strings. To generate hashes for [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) types, you use the internal `_hash_simd()` function: ```mojo from builtin.hash import _hash_simd fn gen_simd_hash(): let vector = SIMD[DType.int64, 4](1, 2, 3, 4) let hash = _hash_simd(vector) ``` * Several standard library types now conform to the [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement) trait. These types include [`Bool`](/mojo/stdlib/builtin/bool/Bool), [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral), [`DynamicVector`](/mojo/stdlib/collections/list/List), [`Tensor`](/max/api/mojo/tensor/tensor/Tensor), [`TensorShape`](/max/api/mojo/tensor/tensor_shape/TensorShape), and [`TensorSpec`](/max/api/mojo/tensor/tensor_spec/TensorSpec). ### 🦋 Changed * `utils.vector` has been moved to a new `collections` package to make space for new collections. This means that if you had previous code that did `from utils.vector import DynamicVector`, it now needs to be `from collections.vector import DynamicVector` due to the move. * The special destructor method `__del__()` has been changed to enforce that it cannot raise an error. Raising destructors are not supported properly at the moment. ### 🛠️ Fixed * [#1421](https://github.com/modular/modular/issues/1421) - Fixed a crash when using Tuples in the REPL. * [#222](https://github.com/modular/modular/issues/222) - Generate an error for obviously self recursive functions. * [#1408](https://github.com/modular/modular/issues/1408) - Fix overload resolution when candidates can return generic types. * [#1413](https://github.com/modular/modular/issues/1413) and [#1395](https://github.com/modular/modular/issues/1395) - Do not crash when re-declaring a builtin declaration. * [#1307](https://github.com/modular/modular/issues/1307) - Fix compatibility of function signatures that only differ in default argument values. * [#1380](https://github.com/modular/modular/issues/1380) - Fix printing of empty `String`. ## v0.6.0 (2023-12-04) ### 🔥 Legendary * Traits have arrived! You can now define a *trait*, which consists of a required set of method prototypes. A struct can *conform to* the trait by implementing these methods. This lets you write generic functions that work on any structs that conform to a given trait. The following section gives a brief overview of traits—see the [Mojo Manual](/mojo/manual/traits) and this [traits blog post](https://modul.ar/traits-blog) for more details! Traits are declared with the `trait` keyword. The bodies of traits should contain method signatures declared with `...` as their bodies. Default method implementations are not supported yet. ```mojo trait SomeTrait: fn required_method(self, x: Int): ... ``` The trait can be implemented on a struct by inheriting from it. ```mojo struct SomeStruct(SomeTrait): fn required_method(self, x: Int): print("hello traits", x) ``` You can then write a generic functions that accepts any type that conforms to the trait. You do this by creating a parameterized function with a trait-typed parameter: ```mojo fn fun_with_traits[T: SomeTrait](x: T): x.required_method(42) ``` Which can be invoked with instances of types that conform to the trait: ```mojo var thing = SomeStruct() # Infer the parameter `T`! fun_with_traits(thing) ``` Traits can also inherit from other traits, which simply requires that implementers of the child trait also conform to all parent traits. ```mojo trait Parent: fn parent_func(self): ... trait Child(Parent): fn child_func(self): ... ``` Then, both child and parent trait methods can be invoked on instances of the trait `Child`. As well, an instance of the child trait can be converted to an instance of the parent trait. ```mojo fn the_parents[T: Parent](x: T): x.parent_func() fn the_children[T: Child](x: T): x.child_func() x.parent_func() # Upcast `x` from instance of `Child` to `Parent`. the_parents(x) ``` For more information, see the [Traits page](/mojo/manual/traits) in the Mojo Manual. * A fundamental `Destructable` trait has been added to the language. This is a core trait that every trait automatically conforms to. This enables destruction of generic types and generic collections. **Note:** We're aware that this trait might be better spelled `Destructible`. We're planning on removing it in the future and moving its functionality to `AnyType` so that any type that doesn't provide its own destructor will have a default, no-op destructor. * We've added some traits to the standard library, you can implement these on your own types: * [`Destructable`](/mojo/stdlib/builtin/anytype/AnyType) * [`Copyable`](/mojo/stdlib/builtin/value/Copyable) * [`Movable`](/mojo/stdlib/builtin/value/Movable) * [`Stringable`](/mojo/stdlib/builtin/str/Stringable) * [`Intable`](/mojo/stdlib/builtin/int/Intable) * [`Sized`](/mojo/stdlib/builtin/len/Sized) * [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement) * We added built-in [`len()`](/mojo/stdlib/builtin/len/len), [`str()`](/mojo/stdlib/builtin/str/str), and [`int()`](/mojo/stdlib/builtin/int/int-function) functions, which work with types that implement the `Sized`, `Stringable`, and `Intable` traits, respectively. * [`DynamicVector`](/mojo/stdlib/collections/list/List) is now a proper generic collection that can use any type that implements the `Movable` and `Copyable` traits. This means you can now write, for example, `DynamicVector[String]`. Also, `DynamicVector` now invokes its element destructors upon destruction, so `_del_old` has been deleted. * `print` now works on any types that implement `Stringable` by invoking their `__str__` method: ```mojo @value struct BoxedInt(Stringable): var value: Int fn __str__(self) -> String: return self.value print(BoxedInt(11), "hello traits!", BoxedInt(42)) ``` ### ⭐️ New * The [Mojo Manual](/mojo/manual/) is an all-new, complete Mojo user guide. It doesn't include *everything* about Mojo yet, but it includes a lot, and more than the original programming manual (now deprecated). Plus, the entire Mojo Manual and other Mojo docs are now [open-sourced on GitHub](https://github.com/modular/modular/tree/main/mojo/docs), and we'd love to accept contributions to help us improve them! * Mojo now supports partial automatic parameterization: when a function is declared with an argument of a partially bound type, the unbound parameters of that type are implicitly added to the function's input parameters. For example: ```mojo @value struct Fudge[a: Int, b: Int, c: Int = 7]: ... # These function declarations are roughly equivalent: fn eat(f: Fudge[5]): ... # implicitly parameterized fn eat[_b: Int](f: Fudge[5, _b]): ... # explicitly parameterized ``` In the first signature for `eat()`, the `b` parameter isn't bound, so it's *implicitly* added as an input parameter on the function. In the second signature for `eat()`, the author has explicitly defined an input parameter (`_b`), which is bound to the second parameter on the argument type (which happens to be `b`). Both functions can be called like this: ```mojo eat(Fudge[5, 8]()) ``` Mojo infers the value of the `b` parameter from the argument (in this case, 8\). With the second signature, you can also pass the `_b` parameter value explicitly: ```mojo eat[3](Fudge[5, 3]()) ``` Moreover, Mojo now allows you to explicitly mark parameters as unbound using the `_` as syntax meaning "placeholder for an unbound parameter." For example: ```mojo # These function declarations are roughly equivalent: fn eat(f: Fudge[5, _, c=_]): ... # implicitly parameterized fn eat(f: Fudge[c=_, a=5, b=_]): ... # implicitly parameterized fn eat[_b: Int, _c: Int](f: Fudge[5, _b, _c]): ... # explicitly parameterized ``` The first two signatures explicitly unbind the `b` and `c` parameters. In the last signature, the `_b` and `_c` parameters are explicitly declared by the author, and bound to the `b` and `c` parameters in the argument type. Any of these signatures can be called like this: ```mojo eat(Fudge[5, 8]()) eat(Fudge[5, 8, 9]()) ``` Note that the default parameter values of struct parameters are bound, unless explicitly unbound by the user. For more information, see the [Mojo Manual](/mojo/manual/parameters/#fully-bound-partially-bound-and-unbound-types). * Parametric types can now be partially bound in certain contexts. For example, a new `Scalar` type alias has been added defined as: ```mojo alias Scalar = SIMD[size=1] ``` Which creates a parametric type alias `Scalar` with a single parameter of type `DType`. Types can also be partially or fully bound in other contexts. For instance, `alias` declarations of type values inside functions now work properly: ```mojo fn type_aliases(): alias T = SIMD print(T[DType.float32, 1]()) alias Partial = T[type=DType.int32] print(Partial[2]()) ``` * The `__mlir_op` feature now supports operations that return multiple results. To use them, you write the `_type` field as a `Tuple` of types. For example: ```mojo # The `ret` variable has type `Tuple[Int, Int]`. let ret = __mlir_op.`multi_result_op`[_type=(Int, Int)]() ``` * Mojo now has the ability to read raw bytes from a file using the [`read_bytes()`](/mojo/stdlib/builtin/file/FileHandle#read_bytes) method. For example: ```mojo with open("file.binary", "r") as f: data = f.read_bytes() ``` * A size argument was added to the [`read()`](/mojo/stdlib/builtin/file/FileHandle#read) and [`read_bytes()`](/mojo/stdlib/builtin/file/FileHandle#read_bytes) methods on the builtin `file.FileHandle`. The size argument defaults to -1 and maintains the previous "read to EOF" behavior when size is negative. ```mojo with open("file.binary", "r") as f: data1 = f.read_bytes(1024) data2 = f.read_bytes(256) ``` * [`Path`](/mojo/stdlib/pathlib/path/Path) now has `read_bytes()` and `read_text()` methods to read file contents from a path: ```mojo let text_path = Path("file.txt") let text = text_path.read_text() let binary_path = Path("file.binary") let data = binary_path.read_bytes() ``` * `Tensor` has new `save()` and `load()` methods to save and load to file. These methods preserve shape and datatype information. For example: ```mojo let tensor = Tensor[DType.float32]() tensor.save(path) let tensor_from_file = Tensor[DType.float32].load(path) ``` * Subscripting added to [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and [`Pointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): ```mojo let p = DTypePointer[DType.float16].alloc(4) for i in range(4): p[i] = i print(p[i]) ``` * `file.FileHandle` now has a `seek()` method. * [`String`](/mojo/stdlib/collections/string/string/String) now has an [`rfind()`](/mojo/stdlib/collections/string/string/String#rfind) method analogous to Python's `str.rfind()`. * `String` now has an [`split()`](/mojo/stdlib/collections/string/string/String#split) method analogous to Python's `str.split()`. * [`Path`](/mojo/stdlib/pathlib/path/Path) now has a [`suffix()`](/mojo/stdlib/pathlib/path/Path#suffix) method analogous to Python's `pathlib.Path.suffix`. * The Mojo REPL now supports indented expressions, making it a bit easier to execute expressions copied from an indented block (such as a doc string). * The Mojo Language Server now implements the Document Symbols request. IDEs use this to provide support for **Outline View** and **Go to Symbol**. This addresses [Issue #960](https://github.com/modular/modular/issues/960). * The Mojo Language Server now shows documentation when code completing modules or packages in `import` statements. * The Mojo Language Server now supports processing code examples, defined as markdown Mojo code blocks, inside of doc strings. This enables IDE features while writing examples in API documentation. * The Mojo Language Server now provides semantic token information, providing better highlighting for symbols whose semantics are not statically analyzable. * The Mojo Language Server now classifies doc strings as folding ranges, making them easier to collapse, reducing vertical space while editing. * Command line options for the `mojo` driver that take arguments can now be written in either of two ways: both `--foo FOO` and `--foo=FOO`. Previously, only the former was valid. ### 🦋 Changed * Variadic list types [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList) and [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem) are now iterable. Variadic arguments are automatically projected into one of these types inside the function body, so var args can be iterated: ```mojo fn print_ints(*nums: Int): for num in nums: print(num) print(len(nums)) ``` * The assert functions in the [`testing`](/mojo/stdlib/testing/testing) package now raise an `Error` when the assertion fails instead of returning a `Bool` for whether the assertion succeeded or not. * Parameters of [`AnyType`](/mojo/stdlib/builtin/type_aliases) type are no longer (implicitly) assumed to be register-passable. A new `AnyRegType` type is used to represent generic types that are register passable. * Changing the units in a [`benchmark`](/mojo/stdlib/benchmark/benchmark) report is now an argument instead of a parameter: ```mojo let report = benchmark.run[timer]() report.print(Unit.ms) ``` * Default values on `inout` arguments are no longer permitted, i.e. the following will now raise an error: ```mojo fn inout_default(inout x: Int = 2): ... ``` * The `to_string()` function has been removed from [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) in favor of the new `__str__()` function. This composes better with traits so it can be used with the generic `str()` function. ### 🛠️ Fixed * [#734](https://github.com/modular/modular/issues/734) - Consumption of struct works only for types with a `__del__` method. * [#910](https://github.com/modular/modular/issues/910) - Parser crash when using memory-only generic type as return of function that `raise`s. * [#1060](https://github.com/modular/modular/issues/1060) - Mojo happily parses code that has messed up indentation * [#1159](https://github.com/modular/modular/issues/1159) - The language server doesn't warn about bad return type. * [#1166](https://github.com/modular/modular/issues/1166) - warning: unreachable code after return statement with context manager * [#1098](https://github.com/modular/modular/issues/1098) - The language server doesn't highlight properties of PythonObjects correctly. * [#1153](https://github.com/modular/modular/issues/1153) - The language server crashes when parsing an invalid multi-nested module import. * [#1236](https://github.com/modular/modular/issues/1236) - The language server doesn't show autocomplete in if statements. * [#1246](https://github.com/modular/modular/issues/1246) - Warning diagnostics are transient in the presence of caching. ### Known Issue * There is an issue affecting Jupyter notebooks that use autotuning and traits. This issue only manifests on macOS, and the same code runs without issue outside of the notebooks. This issue affects the *Matrix multiplication in Mojo* notebook. ## v0.5.0 (2023-11-2) ### ⭐️ New * The [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type now defaults to the architectural SIMD width of the type. This means you can write `SIMD[DType.float32]` which is equivalent to `SIMD[DType.float32, simdwidthof[DType.float32]()]`. * The [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type now contains a `join()` function that allows you to concatenate two `SIMD` values together and produce a new `SIMD` value. * Mojo now supports compile-time *keyword parameters*, in addition to existing support for [keyword arguments](/mojo/manual/parameters/#optional-parameters-and-keyword-parameters). For example: ```mojo fn foo[a: Int, b: Int = 42](): print(a, "+", b) foo[a=5]() # prints '5 + 42' foo[a=7, b=13]() # prints '7 + 13' foo[b=20, a=6]() # prints '6 + 20' ``` Keyword parameters are also supported in structs: ```mojo struct KwParamStruct[a: Int, msg: String = "🔥mojo🔥"]: fn __init__(inout self): print(msg, a) fn use_kw_params(): KwParamStruct[a=42]() # prints '🔥mojo🔥 42' KwParamStruct[5, msg="hello"]() # prints 'hello 5' KwParamStruct[msg="hello", a=42]() # prints 'hello 42' ``` For more detail, see the [Mojo Manual](/mojo/manual/parameters/#optional-parameters-and-keyword-parameters). For the time being, the following notable limitations apply: * Keyword-only parameters are **not supported** yet: ```mojo fn baz[*args: Int, b: Int](): pass # fails fn baz[a: Int, *, b: Int](): pass # fails ``` (The analogous keyword-only arguments in Python are described in [PEP 3102](https://peps.python.org/pep-3102/).) * Variadic keyword parameters are **not supported** yet: ```mojo fn baz[a: Int, **kwargs: Int](): pass # fails ``` * Mojo now supports "automatic" parameterization of functions. What this means is that if a function argument type is parametric but has no bound parameters, they are automatically added as input parameters on the function. This works with existing features to allow you to write parametric functions with less boilerplate. ```mojo @value struct Thing[x: Int, y: Int]: pass fn foo(v: Thing): print(v.x) print(v.y) fn main(): let v = Thing[2, 3]() foo(v) ``` However, partial autoparameterization is **not supported** yet: ```mojo fn foo(v: Thing[y=7]): # Partially bound type not allowed yet. ... ``` * Keyword argument passing is supported when invoking `__getitem__` using the bracket syntax: ```mojo @value struct MyStruct: fn __getitem__(self, x: Int, y: Int, z: Int) -> Int: return x * y + z MyStruct()[z=7, x=3, y=5] # returns 22 ``` However, keyword argument passing to `__setitem__` using the bracket syntax is **not supported** yet: ```mojo @value struct OtherStruct: fn __setitem__(self, x: Int, y: Int): pass OtherStruct()[x=1] = 4 # fails ``` * Function argument input parameters can now be referenced within the signature of the function: ```mojo fn foo(x: SIMD, y: SIMD[x.type, x.size]): pass ``` * The [`benchmark`](/mojo/stdlib/benchmark/benchmark) module has been simplified and improved so you can now run: ```mojo import benchmark from time import sleep fn sleeper(): sleep(.01) fn main(): let report = benchmark.run[sleeper]() print(report.mean()) ``` It no longer requires a capturing `fn` so can benchmark functions outside the same scope. You can print a report with: ```mojo report.print() ``` ```plaintext --------------------- Benchmark Report (s) --------------------- Mean: 0.012314264957264957 Total: 1.440769 Iters: 117 Warmup Mean: 0.0119335 Warmup Total: 0.023866999999999999 Warmup Iters: 2 Fastest Mean: 0.012227958333333334 Slowest Mean: 0.012442699999999999 ``` Units for all functions default to seconds, but can be changed with: ```mojo from benchmark import Unit report.print[Unit.ms]() ``` * Mojo now supports struct parameter deduction (a.k.a. class template argument deduction, or CTAD) for partially bound types. Struct parameter deduction is also possible from static methods. For example: ```mojo @value struct Thing[v: Int]: pass struct CtadStructWithDefault[a: Int, b: Int, c: Int = 8]: fn __init__(inout self, x: Thing[a]): print("hello", a, b, c) @staticmethod fn foo(x: Thing[a]): print("🔥", a, b, c) fn main(): _ = CtadStructWithDefault[b=7](Thing[6]()) # prints 'hello 6 7 8' CtadStructWithDefault[b=7].foo(Thing[6]()) # prints '🔥 6 7 8' ``` * `Tensor` has new `fromfile()` and `tofile()` methods to save and load as bytes from a file. * The built-in `print()` function now works on the [`Tensor`](/max/api/mojo/tensor/tensor/Tensor) type. * [`TensorShape`](/max/api/mojo/tensor/tensor_shape/TensorShape) and [`TensorSpec`](/max/api/mojo/tensor/tensor_spec/TensorSpec) now have constructors that take [`DynamicVector[Int]`](/mojo/stdlib/collections/list/List) and [`IndexList`](/mojo/stdlib/utils/index_/IndexList) to initialize shapes. * The [`String`](/mojo/stdlib/collections/string/string/String) type now has the `count()` and `find()` methods to enable counting the number of occurrences or finding the offset index of a substring in a string. * The `String` type now has a `replace()` method which allows you to replace a substring with another string. ### 🦋 Changed * [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList) and [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem) moved under builtins, and no longer need to be imported. * Variadic arguments are now automatically projected into a `VariadicList` or `VariadicListMem` inside the function body. This allows for more flexibility in using var args. For example: ```mojo fn print_ints(*nums: Int): let len = len(nums) for i in range(len): print(nums[i]) print(len) ``` * The parameters for [`InlinedFixedVector`](/mojo/stdlib/collections/inline_array/InlineArray) have been switched. The parameters are now `[type, size]` instead of `[size, type]`. The `InlinedFixedVector` now has a default size which means that one can just use `InlinedFixedVector` as `InlinedFixedVector[Float32]` and the default size is used. * `write_file()` method in [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer) and [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer) is renamed to `tofile()` to match the Python naming. * Mojo will now utilize all available cores across all NUMA sockets on the host machine by default. The prior default behavior was to use all the cores on the first socket. ### ❌ Removed * The `math.numerics` module is now private, because its types (`FPUtils` and `FlushDenormals`) should not be used externally. ### 🛠️ Fixed * [#532](https://github.com/modular/modular/issues/532) - Compiler optimizing while True loop away * [#760](https://github.com/modular/modular/issues/760) - Compilation error: 'hlcf.for.yield' op specifies 0 branch inputs but target expected 1 along control-flow edge from here * [#849](https://github.com/modular/modular/issues/849) - The `Tensor` type is now initialized with zeros at construction time. * [#912](https://github.com/modular/modular/issues/912) - Invalid load for `__get_address_as_lvalue`. * [#916](https://github.com/modular/modular/issues/916) - Parser crash when specifying default values for `inout` arguments. * [#943](https://github.com/modular/modular/issues/943) - Mojo hangs if you use continue in the nested loop * [#957](https://github.com/modular/modular/issues/957) - Parser crash when a function call with variadic arguments of a memory-only type is evaluated at compile time. * [#990](https://github.com/modular/modular/issues/990) - Fixes rounding issue with floor division with negative numerator. * [#1018](https://github.com/modular/modular/issues/1018) - In some cases the sort function was returning invalid results. This release fixes some of these corner cases. * [#1010](https://github.com/modular/modular/issues/1010) - Initializing tensor in alias declaration results in crash. * [#1110](https://github.com/modular/modular/issues/1110) - The `time.now()` function now returns nanoseconds across all operating systems. * [#1115](https://github.com/modular/modular/issues/1115) - cannot load non-register passable type into SSA register. ## v0.4.0 for Mac (2023-10-19) ### 🔥 Legendary * Mojo for Mac! The Mojo SDK now works on macOS (Apple silicon). This is the same version previously released for Linux. Get the latest version of the SDK for your Mac system: [Download Now!](https://developer.modular.com/download) ## v0.4.0 (2023-10-05) ### ⭐️ New * Mojo now supports default parameter values. For example: ```mojo fn foo[a: Int = 3, msg: StringLiteral = "woof"](): print(msg, a) fn main(): foo() # prints 'woof 3' foo[5]() # prints 'woof 5' foo[7, "meow"]() # prints 'meow 7' ``` Inferred parameter values take precedence over defaults: ```mojo @value struct Bar[v: Int]: pass fn foo[a: Int = 42, msg: StringLiteral = "quack"](bar: Bar[a]): print(msg, a) fn main(): foo(Bar[9]()) # prints 'quack 9' ``` Structs also support default parameters: ```mojo @value struct DefaultParams[msg: StringLiteral = "woof"]: alias message = msg fn main(): print(DefaultParams[]().message) # prints 'woof' print(DefaultParams["meow"]().message) # prints 'meow' ``` * The new [`file`](/mojo/stdlib/builtin/file) module adds basic file I/O support. You can now write: ```mojo var f = open("my_file.txt", "r") print(f.read()) f.close() ``` or ```mojo with open("my_file.txt", "r") as f: print(f.read()) ``` * Mojo now allows context managers to support an `__enter__` method without implementing support for an `__exit__` method, enabling idioms like this: ```mojo # This context manager consumes itself and returns it as the value. fn __enter__(owned self) -> Self: return self^ ``` Here Mojo *cannot* invoke a noop `__exit__` method because the context manager is consumed by the `__enter__` method. This can be used for types (like file descriptors) that are traditionally used with `with` statements, even though Mojo's guaranteed early destruction doesn't require that. * A very basic version of `pathlib` has been implemented in Mojo. The module will be improved to achieve functional parity with Python in the next few releases. * The `memory.unsafe` module now contains a `bitcast` function. This is a low-level operation that enables bitcasting between pointers and scalars. * The input parameters of a parametric type can now be directly accessed as attribute references on the type or variables of the type. For example: ```mojo @value struct Thing[param: Int]: pass fn main(): print(Thing[2].param) # prints '2' let x = Thing[9]() print(x.param) # prints '9' ``` Input parameters on values can even be accessed in parameter contexts. For example: ```mojo fn foo[value: Int](): print(value) let y = Thing[12]() alias constant = y.param + 4 foo[constant]() # prints '16' ``` * The Mojo REPL now supports code completion. Press Tab while typing to query potential completion results. * Error messages from Python are now exposed in Mojo. For example the following should print `No module named 'my_uninstalled_module'`: ```mojo fn main(): try: let my_module = Python.import_module("my_uninstalled_module") except e: print(e) ``` * Error messages can now store dynamic messages. For example, the following should print "Failed on: Hello" ```mojo fn foo(x: String) raises: raise Error("Failed on: " + x) fn main(): try: foo("Hello") except e: print(e) ``` ### 🦋 Changed * We have improved and simplified the `parallelize` function. The function now elides some overhead by caching the Mojo parallel runtime. * The Mojo REPL and Jupyter environments no longer implicitly expose `Python`, `PythonObject`, or `Pointer`. These symbols must now be imported explicitly, for example: ```mojo from python import Python from python.object import PythonObject from memory.unsafe import Pointer ``` * The syntax for specifying attributes with the `__mlir_op` prefix have changed to mimic Python's keyword argument passing syntax. That is, `=` should be used instead of `:`, e.g.: ```mojo # Old syntax, now fails. __mlir_op.`index.bool.constant`[value : __mlir_attr.false]() # New syntax. __mlir_op.`index.bool.constant`[value=__mlir_attr.false]() ``` * You can now print the `Error` object directly. The `message()` method has been removed. ### 🛠️ Fixed * [#794](https://github.com/modular/modular/issues/794) - Parser crash when using the `in` operator. * [#936](https://github.com/modular/modular/issues/936) - The `Int` constructor now accepts other `Int` instances. * [#921](https://github.com/modular/modular/issues/921) - Better error message when running `mojo` on a module with no `main` function. * [#556](https://github.com/modular/modular/issues/556) - UInt64s are now printed correctly. * [#804](https://github.com/modular/modular/issues/804) - Emit error instead of crashing when passing variadic arguments of unsupported types. * [#833](https://github.com/modular/modular/issues/833) - Parser crash when assigning module value. * [#752](https://github.com/modular/modular/issues/752) - Parser crash when calling async def. * [#711](https://github.com/modular/modular/issues/711) - The overload resolution logic now correctly prioritizes instance methods over static methods (if candidates are an equally good match otherwise), and no longer crashed if a static method has a `Self` type as its first argument. * [#859](https://github.com/modular/modular/issues/859) - Fix confusing error and documentation of the `rebind` builtin. * [#753](https://github.com/modular/modular/issues/753) - Direct use of LLVM dialect produces strange errors in the compiler. * [#926](https://github.com/modular/modular/issues/926) - Fixes an issue that occurred when a function with a return type of `StringRef` raised an error. When the function raised an error, it incorrectly returned the string value of that error. * [#536](https://github.com/modular/modular/issues/536) - Report More information on python exception. ## v0.3.1 (2023-09-28) Our first-ever patch release of the Mojo SDK is here! Release v0.3.1 includes primarily installation-related fixes. If you’ve had trouble installing the previous versions of the SDK, this release may be for you. ### 🛠️ Fixed * [#538](https://github.com/modular/modular/issues/538) - Installation hangs during the testing phase. This issue occurs on machines with a low number of CPU cores, such as free AWS EC2 instances and GitHub Codespaces. * [#590](https://github.com/modular/modular/issues/590) - Installation fails with a “failed to run python” message. * [#672](https://github.com/modular/modular/issues/672) - Language server hangs on code completion. Related to #538, this occurs on machines with a low number of CPU cores. * [#913](https://github.com/modular/modular/issues/913) - In the REPL and Jupyter notebooks, inline comments were being parsed incorrectly. ## v0.3.0 (2023-09-21) There's more Mojo to love in this, the second release of the Mojo SDK! This release includes new features, an API change, and bug fixes. There's also an updated version of the [Mojo extension for VS Code](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo). ### ⭐️ New * Mojo now has partial support for passing keyword arguments to functions and methods. For example the following should work: ```mojo fn foo(a: Int, b: Int = 3) -> Int: return a * b fn main(): print(foo(6, b=7)) # prints '42' print(foo(a=6, b=7)) # prints '42' print(foo(b=7, a=6)) # prints '42' ``` Parameters can also be inferred from keyword arguments, for example: ```mojo fn bar[A: AnyType, B: AnyType](a: A, b: B): print("Hello 🔥") fn bar[B: AnyType](a: StringLiteral, b: B): print(a) fn main(): bar(1, 2) # prints `Hello 🔥` bar(b=2, a="Yay!") # prints `Yay!` ``` For the time being, the following notable limitations apply: * Keyword-only arguments are not supported: ```mojo fn baz(*args: Int, b: Int): pass # fails fn baz(a: Int, *, b: Int): pass # fails ``` (Keyword-only arguments are described in [PEP 3102](https://peps.python.org/pep-3102/).) * Variadic keyword arguments are not supported: ```mojo fn baz(a: Int, **kwargs: Int): pass # fails ``` * Mojo now supports the `@nonmaterializable` decorator. The purpose is to mark data types that should only exist in the parameter domain. To use it, a struct is decorated with `@nonmaterializable(TargetType)`. Any time the nonmaterializable type is converted from the parameter domain, it is automatically converted to `TargetType`. A nonmaterializable struct should have all of its methods annotated as `@always_inline`, and must be computable in the parameter domain. In the following example, the `NmStruct` type can be added in the parameter domain, but are converted to `HasBool` when materialized. ```mojo @value @register_passable("trivial") struct HasBool: var x: Bool fn __init__(x: Bool) -> Self: return Self {x: x} @always_inline("nodebug") fn __init__(nms: NmStruct) -> Self: return Self {x: True if (nms.x == 77) else False} @value @nonmaterializable(HasBool) @register_passable("trivial") struct NmStruct: var x: Int @always_inline("nodebug") fn __add__(self: Self, rhs: Self) -> Self: return NmStruct(self.x + rhs.x) alias stillNmStruct = NmStruct(1) + NmStruct(2) # When materializing to a run-time variable, it is automatically converted, # even without a type annotation. let convertedToHasBool = stillNmStruct ``` * Mojo integer literals now produce the `IntLiteral` infinite precision integer type when used in the parameter domain. `IntLiteral` is materialized to the `Int` type for runtime computation, but intermediate computations at compile time, using supported operators, can now exceed the bit width of the `Int` type. * The Mojo Language Server now supports top-level code completions, enabling completion when typing a reference to a variable, type, etc. This resolves [#679](https://github.com/modular/modular/issues/679). * The Mojo REPL now colorizes the resultant variables to help distinguish input expressions from the output variables. ### 🦋 Changed * Mojo allows types to implement two forms of move constructors, one that is invoked when the lifetime of one value ends, and one that is invoked if the compiler cannot prove that. These were previously both named `__moveinit__`, with the following two signatures: ```mojo fn __moveinit__(inout self, owned existing: Self): ... fn __moveinit__(inout self, inout existing: Self): ... ``` We've changed the second form to get its own name to make it more clear that these are two separate operations: the second has been renamed to `__takeinit__`: ```mojo fn __moveinit__(inout self, owned existing: Self): ... fn __takeinit__(inout self, inout existing: Self): ... ``` The name is intended to connote that the operation takes the conceptual value from the source (without destroying it) unlike the first one which "moves" a value from one location to another. For more information, see the Mojo Manual section on [move constructors](/mojo/manual/lifecycle/life#move-constructor). * The Error type in Mojo has changed. Instead of extracting the error message using `error.value` you will now extract the error message using `error.message()`. ### 🛠️ Fixed * [#503](https://github.com/modular/modular/issues/503) - Improve error message for failure lowering `kgen.param.constant`. * [#554](https://github.com/modular/modular/issues/554) - Alias of static tuple fails to expand. * [#500](https://github.com/modular/modular/issues/500) - Call expansion failed due to verifier error. * [#422](https://github.com/modular/modular/issues/422) - Incorrect comment detection in multiline strings. * [#729](https://github.com/modular/modular/issues/740) - Improve messaging on how to exit the REPL. * [#756](https://github.com/modular/modular/issues/756) - Fix initialization errors of the VS Code extension. * [#575](https://github.com/modular/modular/issues/575) - Build LLDB/REPL with libedit for a nicer editing experience in the terminal. ## v0.2.1 (2023-09-07) The first versioned release of Mojo! 🔥 All earlier releases were considered version 0.1. ### 🔥 Legendary * First release of the Mojo SDK! You can now develop with Mojo locally. The Mojo SDK is currently available for Ubuntu Linux systems, and support for Windows and macOS is coming soon. You can still develop from a Windows or Mac computer using a container or remote Linux system. The Mojo SDK includes the Mojo standard library and the [Mojo command-line interface](/mojo/cli/) (CLI), which allows you to run, compile, and package Mojo code. It also provides a REPL programming environment. [Get the Mojo SDK!](https://developer.modular.com/download) * First release of the [Mojo extension for VS Code](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo). This provides essential Mojo language features in Visual Studio Code, such as code completion, code quick fixes, docs tooltips, and more. Even when developing on a remote system, using VS Code with this extension provides a native-like IDE experience. ### ⭐️ New * A new `clobber_memory` function has been added to the [`benchmark`](/mojo/stdlib/benchmark/benchmark) module. The clobber memory function tells the system to flush all memory operations at the specified program point. This allows you to benchmark operations without the compiler reordering memory operations. * A new `keep` function has been added to the [`benchmark`](/mojo/stdlib/benchmark/benchmark) module. The `keep` function tries to tell the compiler not to optimize the variable away if not used. This allows you to avoid compiler's dead code elimination mechanism, with a low footprint side effect. * New `shift_right` and `shift_left` functions have been added to the [`simd`](/mojo/stdlib/builtin/simd) module. They shift the elements in a SIMD vector right/left, filling elements with zeros as needed. * A new `cumsum` function has been added to the [`reduction`](/mojo/stdlib/algorithm/reduction) module that computes the cumulative sum (also known as scan) of input elements. * Mojo Jupyter kernel now supports code completion. ### 🦋 Changed * Extends `rotate_bits_left`, `rotate_left`, `rotate_bits_right`, and `rotate_right` to operate on Int values. The ordering of parameters has also been changed to enable type inference. Now it's possible to write `rotate_right[shift_val](simd_val)` and have the `dtype` and `simd_width` inferred from the argument. This addresses [Issue #528](https://github.com/modular/modular/issues/528). ### 🛠️ Fixed * Fixed a bug causing the parser to crash when the `with` statement was written without a colon. This addresses [Issue #529](https://github.com/modular/modular/issues/529). * Incorrect imports no longer crash when there are other errors at the top level of a module. This fixes [Issue \#531](https://github.com/modular/modular/issues/531). ## August 2023 ### 2023-08-24 * Fixed issue where the `with expr as x` statement within `fn` behaved as if it were in a `def`, binding `x` with function scope instead of using lexical scope. #### ⭐️ New * Major refactoring of the standard library to enable packaging and better import ergonomics: * The packages are built as binaries to improve startup speed. * Package and module names are now lowercase to align with the Python style. * Modules have been moved to better reflect the purpose of the underlying functions (e.g. `Pointer` is now within the `unsafe` module in the `memory` package). * The following modules are now included as built-ins: `SIMD`, `DType`, `IO`, `Object`, and `String`. This means it's no longer necessary to explicitly import these modules. Instead, these modules will be implicitly imported for the user. Private methods within the module are still accessible using the `builtin.module_name._private_method` import syntax. * New `math` package has been added to contain the `bit`, `math`, `numerics`, and `polynomial` modules. The contents of the `math.math` module are re-exported into the `math` package. * Mojo now supports using memory-only types in parameter expressions and as function or type parameters: ```mojo @value struct IntPair: var first: Int var second: Int fn add_them[value: IntPair]() -> Int: return value.first + value.second fn main(): print(add_them[IntPair(1, 2)]()) # prints '3' ``` * In addition, Mojo supports evaluating code that uses heap-allocated memory at compile-time and materializing compile-time values with heap-allocated memory into dynamic values: ```mojo fn fillVector(lowerBound: Int, upperBound: Int, step: Int) -> DynamicVector[Int]: var result = DynamicVector[Int]() for i in range(lowerBound, upperBound, step): result.push_back(i) return result fn main(): alias values = fillVector(5, 23, 7) for i in range(0, values.__len__()): print(values[i]) # prints '5', '12', and then '19' ``` #### 🦋 Changed * `def main():`, without the explicit `None` type, can now be used to define the entry point to a Mojo program. * The `assert_param` function has been renamed to `constrained` and is now a built-in function. * The `print` function now works on `Complex` values. #### 🛠️ Fixed * Fixed issues with print formatting for `DType.uint16` and `DType.int16`. * [Issue #499](https://github.com/modular/modular/issues/499) - Two new `rotate_right` and `rotate_left` functions have been added to the SIMD module. * [Issue #429](https://github.com/modular/modular/issues/429) - You can now construct a `Bool` from a `SIMD` type whose element-type is `DType.bool`. * [Issue #350](https://github.com/modular/modular/issues/350) - Confusing Matrix implementation * [Issue #349](https://github.com/modular/modular/issues/349) - Missing load\_tr in struct Matrix * [Issue #501](https://github.com/modular/modular/issues/501) - Missing syntax error messages in Python expressions. ### 2023-08-09 #### 🦋 Changed * The `ref` and `mutref` identifiers are now treated as keywords, which means they cannot be used as variable, attribute, or function names. These keywords are used by the "lifetimes" features, which is still in development. We can consider renaming these (as well as other related keywords) when the development work gels, support is enabled in public Mojo builds, and when we have experience using them. * The argument handling in `def` functions has changed: previously, they had special behavior that involved mutable copies in the callee. Now, we have a simple rule, which is that `def` argument default to the `owned` convention (`fn` arguments still default to the `borrowed` convention). This change is mostly an internal cleanup and simplification of the compiler and argument model, but does enable one niche use-case: you can now pass non-copyable types to `def` arguments by transferring ownership of a value into the `def` call. Before, that would not be possible because the copy was made on the callee side, not the caller's side. This also allows the explicit use of the `borrowed` keyword with a `def` that wants to opt-in to that behavior. ### 2023-08-03 #### ⭐️ New * A new [`Tensor`](/max/api/mojo/tensor/tensor/Tensor) type has been introduced. This tensor type manages its own data (unlike `NDBuffer` and `Buffer` which are just views). Therefore, the tensor type performs its own allocation and free. Here is a simple example of using the tensor type to represent an RGB image and convert it to grayscale: ```mojo from tensor import Tensor, TensorShape from utils.index import Index from random import rand let height = 256 let width = 256 let channels = 3 # Create the tensor of dimensions height, width, channels and fill with # random value. let image = rand[DType.float32](height, width, channels) # Declare the grayscale image. var gray_scale_image = Tensor[DType.float32](height, width) # Perform the RGB to grayscale transform. for y in range(height): for x in range(width): let r = image[y, x, 0] let g = image[y, x, 1] let b = image[y, x, 2] gray_scale_image[Index(y, x)] = 0.299 * r + 0.587 * g + 0.114 * b ``` #### 🛠️ Fixed * [Issue #53](https://github.com/modular/modular/issues/53) - `Int` now implements true division with the `/` operator. Similar to Python, this returns a 64-bit floating point number. The corresponding in-place operator, `/=`, has the same semantics as `//=`. ## July 2023 ### 2023-07-26 #### ⭐️ New * Types that define both `__getitem__` and `__setitem__` (i.e. where sub-scripting instances creates computed LValues) can now be indexed in parameter expressions. * Unroll decorator for loops with constant bounds and steps: * `@unroll`: Fully unroll a loop. * `@unroll(n)`: Unroll a loop by factor of n, where `n` is a positive integer. * Unroll decorator requires loop bounds and iteration step to be compiler time constant value, otherwise unrolling will fail with compilation error. This also doesn't make loop induction variable a parameter. ```mojo # Fully unroll the loop. @unroll for i in range(5): print(i) # Unroll the loop by a factor of 4 (with remainder iterations of 2). @unroll(4) for i in range(10): print(i) ``` * The Mojo REPL now prints the values of variables defined in the REPL. There is full support for scalars and structs. Non-scalar SIMD vectors are not supported at this time. #### 🛠️ Fixed * [Issue #437](https://github.com/modular/modular/issues/437) - Range can now be instantiated with a PythonObject. * [Issue #288](https://github.com/modular/modular/issues/288) - Python strings can now be safely copied. ### 2023-07-20 #### ⭐️ New * Mojo now includes a `Limits` module, which contains functions to get the max and min values representable by a type, as requested in [Issue \#51](https://github.com/modular/modular/issues/51). The following functions moved from `Math` to `Limits`: `inf()`, `neginf()`, `isinf()`, `isfinite()`. * Mojo decorators are now distinguished between "signature" and "body" decorators and are ordered. Signature decorators, like `@register_passable` and `@parameter`, modify the type of declaration before the body is parsed. Body decorators, like `@value`, modify the body of declaration after it is fully parsed. Due to ordering, a signature decorator cannot be applied after a body decorator. That means the following is now invalid: ```mojo @register_passable # error: cannot apply signature decorator after a body one! @value struct Foo: pass ``` * Global variables can now be exported in Mojo compiled archives, using the `@export` decorator. Exported global variables are public symbols in compiled archives and use the variable name as its linkage name, by default. A custom linkage name can be specified with `@export("new_name")`. This does not affect variable names in Mojo code. * Mojo now supports packages! A Mojo package is defined by placing an `__init__.mojo` or `__init__.🔥` within a directory. Other files in the same directory form modules within the package (this works exactly like it does [in Python](https://docs.python.org/3/tutorial/modules.html#packages)). Example: ```bash main.🔥 my_package/ __init__.🔥 module.🔥 my_other_package/ __init__.🔥 stuff.🔥 ``` ```mojo # main.🔥 from my_package.module import some_function from my_package.my_other_package.stuff import SomeType fn main(): var x: SomeType = some_function() ``` * Mojo now supports direct module and package imports! Modules and packages can be imported and bound to names. Module and package elements, like functions, types, global variables, and other modules, can be accessed using attribute references, like `my_module.foo`. Note that modules lack runtime representations, meaning module references cannot be instantiated. ```mojo import builtin.io as io import SIMD io.print("hello world") var x: SIMD.Float32 = 1.2 ``` #### 🦋 Changed * Reverted the feature from 2023-02-13 that allowed unqualified struct members. Use the `Self` keyword to conveniently access struct members with bound parameters instead. This was required to fix [Issue #260](https://github.com/modular/modular/issues/260). * Updated the RayTracing notebook: added step 5 to create specular lighting for more realistic images and step 6 to add a background image. #### 🛠️ Fixed * [Issue #260](https://github.com/modular/modular/issues/260) - Definitions inside structs no longer shadow definitions outside of struct definitions. ### 2023-07-12 #### ⭐️ New * Mojo now has support for global variables! This enables `var` and `let` declaration at the top-level scope in Mojo files. Global variable initializers are run when code modules are loaded by the platform according to the order of dependencies between global variables, and their destructors are called in the reverse order. * The Mojo programming manual is now written as a Jupyter notebook, and available in its entirety in the Mojo Playground (`programming-manual.ipynb`). (Previously, `HelloMojo.ipynb` included most of the same material, but it was not up-to-date.) * As a result, we've also re-written `HelloMojo.ipynb` to be much shorter and provide a more gentle first-user experience. * [`Coroutine` module documentation](/mojo/stdlib/builtin/coroutine) is now available. Coroutines form the basis of Mojo's support for asynchronous execution. Calls to `async fn`s can be stored into a `Coroutine`, from which they can be resumed, awaited upon, and have their results retrieved upon completion. #### 🦋 Changed * `simd_bit_width` in the `TargetInfo` module has been renamed to `simdbitwidth` to better align with `simdwidthof`, `bitwidthof`, etc. #### 🛠️ Fixed * The walrus operator now works in if/while statements without parentheses, e.g. `if x := function():`. * [Issue #428](https://github.com/modular/modular/issues/428) - The `FloatLiteral` and `SIMD` types now support conversion to `Int` via the `to_int` or `__int__` method calls. The behavior matches that of Python, which rounds towards zero. ### 2023-07-05 #### ⭐️ New * Tuple expressions now work without parentheses. For example, `a, b = b, a` works as you'd expect in Python. * Chained assignments (e.g. `a = b = 42`) and the walrus operator (e.g. `some_function(b := 17)`) are now supported. #### 🦋 Changed * The `simd_width` and `dtype_simd_width` functions in the [`TargetInfo`](/mojo/stdlib/sys/info) module have been renamed to `simdwidthof`. * The `dtype_` prefix has been dropped from `alignof`, `sizeof`, and `bitwidthof`. You can now use these functions (e.g. `alignof`) with any argument type, including `DType`. * The `inf`, `neginf`, `nan`, `isinf`, `isfinite`, and `isnan` functions were moved from the `Numerics` module to the [`Math`](/mojo/stdlib/math/math/) module, to better align with Python's library structure. #### 🛠️ Fixed * [Issue #253](https://github.com/modular/modular/issues/253) - Issue when accessing a struct member alias without providing parameters. * [Issue #404](https://github.com/modular/modular/issues/404) - The docs now use `snake_case` for variable names, which more closely conforms to Python's style. * [Issue #379](https://github.com/modular/modular/issues/379) - Tuple limitations have been addressed and multiple return values are now supported, even without parentheses. * [Issue #347](https://github.com/modular/modular/issues/347) - Tuples no longer require parentheses. * [Issue #320](https://github.com/modular/modular/issues/320) - Python objects are now traversable via `for` loops. ## June 2023 ### 2023-06-29 #### ⭐️ New * You can now share `.ipynb` notebook files in Mojo Playground. Just save a file in the `shared` directory, and then right-click the file and select **Copy Sharable link**. To open a shared notebook, you must already have access to Mojo Playground; when you open a shared notebook, click **Import** at the top of the notebook to save your own copy. For more details about this feature, see the instructions inside the `help` directory, in the Mojo Playground file browser. #### 🦋 Changed * The `unroll2()` and `unroll3()` functions in the [`Functional`](/mojo/stdlib/algorithm/functional) module have been renamed to overload the `unroll()` function. These functions unroll 2D and 3D loops and `unroll()` can determine the intent based on the number of input parameters. #### 🛠️ Fixed * [Issue #229](https://github.com/modular/modular/issues/229) - Issue when throwing an exception from `__init__` before all fields are initialized. * [Issue #74](https://github.com/modular/modular/issues/74) - Struct definition with recursive reference crashes. * [Issue #285](https://github.com/modular/modular/issues/285) - The [`TargetInfo`](/mojo/stdlib/sys/info) module now includes `is_little_endian()` and `is_big_endian()` to check if the target host uses either little or big endian. * [Issue #254](https://github.com/modular/modular/issues/254) - Parameter name shadowing in nested scopes is now handled correctly. ### 2023-06-21 #### ⭐️ New * Added support for overloading on parameter signature. For example, it is now possible to write the following: ```mojo fn foo[a: Int](x: Int): pass fn foo[a: Int, b: Int](x: Int): pass ``` For details on the overload resolution logic, see the Mojo Manual section on [parameters](/mojo/manual/parameters/#overloading-on-parameters). * A new `cost_of()` function has been added to `Autotune`. This meta-function must be invoked at compile time, and it returns the number of MLIR operations in a function (at a certain stage in compilation), which can be used to build basic heuristics in higher-order generators. ```mojo from autotune import cost_of fn generator[f: fn(Int) -> Int]() -> Int: @parameter if cost_of[fn(Int) -> Int, f]() SIMD[T, w]: ... @adaptive fn foobar[w: Int, S: DType]() -> SIMD[S, w]: ... ``` * [Issue #219](https://github.com/modular/modular/issues/219) - Issue when redefining a function and a struct defined in the same cell. * [Issue #355](https://github.com/modular/modular/issues/355) - The loop order in the Matmul notebook for Python and naive mojo have been reordered for consistency. The loop order now follows (M, K, N) ordering. * [Issue #309](https://github.com/modular/modular/issues/309) - Use snake case naming within the testing package and move the asserts out of the TestSuite struct. ### 2023-06-14 #### ⭐️ New * Tuple type syntax is now supported, e.g. the following works: ```mojo fn return_tuple() -> (Int, Int): return (1, 2) ``` #### 🦋 Changed * The `TupleLiteral` type was renamed to just `Tuple`, e.g. `Tuple[Int, Float]`. #### 🛠️ Fixed * [Issue #354](https://github.com/modular/modular/issues/354) - Returning a tuple doesn't work even with parens. * [Issue #365](https://github.com/modular/modular/issues/365) - Copy-paste error in `FloatLiteral` docs. * [Issue #357](https://github.com/modular/modular/issues/357) - Crash when missing input parameter to variadic parameter struct member function. ### 2023-06-07 #### ⭐️ New * Tuple syntax now works on the left-hand side of assignments (in "lvalue" positions), enabling things like `(a, b) = (b, a)`. There are several caveats: the element types must exactly match (no implicit conversions), this only works with values of `TupleLiteral` type (notably, it will not work with `PythonObject` yet) and parentheses are required for tuple syntax. #### ❌ Removed * Mojo Playground no longer includes the following Python packages (due to size, compute costs, and [environment complications](https://github.com/modular/modular/issues/300)): `torch`, `tensorflow`, `keras`, `transformers`. #### 🦋 Changed * The data types and scalar names now conform to the naming convention used by numpy. So we use `Int32` instead of `SI32`, similarly using `Float32` instead of `F32`. Closes [Issue #152](https://github.com/modular/modular/issues/152). #### 🛠️ Fixed * [Issue #287](https://github.com/modular/modular/issues/287) - computed lvalues don't handle raising functions correctly * [Issue #318](https://github.com/modular/modular/issues/318) - Large integers are not being printed correctly * [Issue #326](https://github.com/modular/modular/issues/326) - Float modulo operator is not working as expected * [Issue #282](https://github.com/modular/modular/issues/282) - Default arguments are not working as expected * [Issue #271](https://github.com/modular/modular/issues/271) - Confusing error message when converting between function types with different result semantics ## May 2023 ### 2023-05-31 #### ⭐️ New * Mojo Playground now includes the following Python packages (in response to [popular demand](https://github.com/modular/modular/discussions/173)): `torch`, `tensorflow`, `polars`, `opencv-python`, `keras`, `Pillow`, `plotly`, `seaborn`, `sympy`, `transformers`. * A new optimization is applied to non-trivial copyable values that are passed as an owned value without using the transfer (`^`) operator. Consider code like this: ```mojo var someValue: T = ... ... takeValueAsOwned(someValue) ... ``` When `takeValueAsOwned()` takes its argument as an [`owned`](/mojo/manual/values/ownership#transfer-arguments-owned-and-) value (this is common in initializers for example), it is allowed to do whatever it wants with the value and destroy it when it is finished. In order to support this, the Mojo compiler is forced to make a temporary copy of the `someValue` value, and pass that value instead of `someValue`, because there may be other uses of `someValue` after the call. The Mojo compiler is now smart enough to detect when there are no uses of `someValue` later, and it will elide the copy just as if you had manually specified the transfer operator like `takeValueAsOwned(someValue^)`. This provides a nice "it just works" behavior for non-trivial types without requiring manual management of transfers. If you'd like to take full control and expose full ownership for your type, just don't make it copyable. Move-only types require the explicit transfer operator so you can see in your code where all ownership transfer happen. * Similarly, the Mojo compiler now transforms calls to `__copyinit__` methods into calls to `__moveinit__` when that is the last use of the source value along a control flow path. This allows types which are both copyable and movable to get transparent move optimization. For example, the following code is compiled into moves instead of copies even without the use of the transfer operator: ```mojo var someValue = somethingCopyableAndMovable() use(someValue) ... let otherValue = someValue # Last use of someValue use(otherValue) ... var yetAnother = otherValue # Last use of otherValue mutate(yetAnother) ``` This is a significant performance optimization for things like `PythonObject` (and more complex value semantic types) that are commonly used in a fluid programming style. These don't want extraneous reference counting operations performed by its copy constructor. If you want explicit control over copying, it is recommended to use a non-dunder `.copy()` method instead of `__copyinit__`, and recall that non-copyable types must always use of the transfer operator for those that want fully explicit behavior. #### 🛠️ Fixed * [Issue #231](https://github.com/modular/modular/issues/231) - Unexpected error when a Python expression raises an exception * [Issue #119](https://github.com/modular/modular/issues/119) - The REPL fails when a python variable is redefined ### 2023-05-24 #### ⭐️ New * `finally` clauses are now supported on `try` statements. In addition, `try` statements no longer require `except` clauses, allowing `try-finally` blocks. `finally` clauses contain code that is always executed from control-flow leaves any of the other clauses of a `try` statement by any means. #### 🦋 Changed * `with` statement emission changed to use the new `finally` logic so that ```mojo with ContextMgr(): return ``` Will correctly execute `ContextMgr.__exit__` before returning. #### 🛠️ Fixed * [Issue #204](https://github.com/modular/modular/issues/204) - Mojo REPL crash when returning a String at compile-time * [Issue #143](https://github.com/modular/modular/issues/143) - synthesized init in `@register_passable` type doesn't get correct convention. * [Issue #201](https://github.com/modular/modular/issues/201) - String literal concatenation is too eager. * [Issue #209](https://github.com/modular/modular/issues/209) - \[QoI] Terrible error message trying to convert a type to itself. * [Issue #32](https://github.com/modular/modular/issues/32) - Include struct fields in docgen * [Issue #50](https://github.com/modular/modular/issues/50) - Int to string conversion crashes due to buffer overflow * [Issue #132](https://github.com/modular/modular/issues/132) - PythonObject `to_int` method has a misleading name * [Issue #189](https://github.com/modular/modular/issues/189) - PythonObject bool conversion is incorrect * [Issue #65](https://github.com/modular/modular/issues/65) - Add SIMD constructor from Bool * [Issue #153](https://github.com/modular/modular/issues/153) - Meaning of `Time.now` function result is unclear * [Issue #165](https://github.com/modular/modular/issues/165) - Type in `Pointer.free` documentation * [Issue #210](https://github.com/modular/modular/issues/210) - Parameter results cannot be declared outside top-level in function * [Issue #214](https://github.com/modular/modular/issues/214) - Pointer offset calculations at compile-time are incorrect * [Issue #115](https://github.com/modular/modular/issues/115) - Float printing does not include the right number of digits * [Issue #202](https://github.com/modular/modular/issues/202) - `kgen.unreachable` inside nested functions is illegal * [Issue #235](https://github.com/modular/modular/issues/235) - Crash when register passable struct field is not register passable * [Issue #237](https://github.com/modular/modular/issues/237) - Parameter closure sharp edges are not documented ### 2023-05-16 #### ⭐️ New * Added missing dunder methods to `PythonObject`, enabling the use of common arithmetic and logical operators on imported Python values. * `PythonObject` is now printable from Mojo, instead of requiring you to import Python's print function. #### 🛠️ Fixed * [Issue #98](https://github.com/modular/modular/issues/98): Incorrect error with lifetime tracking in loop. * [Issue #49](https://github.com/modular/modular/issues/49): Type inference issue (?) in 'ternary assignment' operation (FloatLiteral vs. 'SIMD\[f32, 1]'). * [Issue #48](https://github.com/modular/modular/issues/48): and/or don't work with memory-only types. * [Issue #11](https://github.com/modular/modular/issues/11): `setitem` Support for `PythonObject`. ### 2023-05-11 #### ⭐️ New * `NDBuffer` and `Buffer` are now constructable via `Pointer` and `DTypePointer`. * `String` now supports indexing with either integers or slices. * Added factorial function to the `Math` module. #### 🦋 Changed * The "byref" syntax with the `&` sigil has changed to use an `inout` keyword to be more similar to the `borrowed` and `owned` syntax in arguments. Please see [Issue #7](https://github.com/modular/modular/issues/7) for more information. * Optimized the Matrix multiplication implementation in the notebook. Initially we were optimizing for expandability rather than performance. We have found a way to get the best of both worlds and now the performance of the optimized Matmul implementation is 3x faster. * Renamed the [`^` postfix operator](/mojo/manual/values/ownership#transfer-arguments-owned-and-) from "consume" to "transfer." #### 🛠️ Fixed * Fixed missing overloads for `Testing.assertEqual` so that they work on `Integer` and `String` values. * [Issue #6](https://github.com/modular/modular/issues/6): Playground stops evaluating cells when a simple generic is defined. * [Issue #18](https://github.com/modular/modular/issues/18): Memory leak in Python interoperability was removed. ### 2023-05-02 #### 📢 Released * Mojo publicly launched! This was epic, with lots of great coverage online including a [wonderful post by Jeremy Howard](https://www.fast.ai/posts/2023-05-03-mojo-launch.html). The team is busy this week. #### ⭐️ New * Added a Base64 encoding function to perform base64 encoding on strings. #### 🦋 Changed * Decreased memory usage of serialization of integers to strings. * Speedup the sort function. #### 🛠️ Fixed * Fixed time unit in the `sleep` function. ## April 2023 ### Week of 2023-04-24 * 📢 The default behavior of nested functions has been changed. Mojo nested functions that capture are by default are non-parametric, runtime closures, meaning that: ```mojo def foo(x): # This: def bar(y): return x * y # Is the same as: let bar = lambda y: x * y ``` These closures cannot have input or result parameters, because they are always materialized as runtime values. Values captured in the closure (`x` in the above example), are captured by copy: values with copy constructors cannot be copied and captures are immutable in the closure. Nested functions that don't capture anything are by default "parametric" closures: they can have parameters and they can be used as parameter values. To restore the previous behavior for capturing closures, "parametric, capture-by-unsafe-reference closures", tag the nested function with the `@parameter` decorator. * 📢 Mojo now has full support for "runtime" closures: nested functions that capture state materialized as runtime values. This includes taking the address of functions, indirect calls, and passing closures around through function arguments. Note that capture-by-reference is still unsafe! You can also take references to member functions with instances of that class using `foo.member_function`, which creates a closure with `foo` bound to the `self` argument. * 📢 Mojo now supports Python style `with` statements and context managers. These things are very helpful for implementing things like our trace region support and things like Runtime support. A context manager in Mojo implements three methods: ```mojo fn __enter__(self) -> T: fn __exit__(self): fn __exit__(self, err: Error) -> Bool: ``` The first is invoked when the context is entered, and returns a value that may optionally be bound to a target for use in the with body. If the with block exits normally, the second method is invoked to clean it up. If an error is raised, the third method is invoked with the Error value. If that method returns true, the error is considered handled, if it returns false, the error is re-thrown so propagation continues out of the 'with' block. * 📢 Mojo functions now support variable scopes! Explicit `var` and `let` declarations inside functions can shadow declarations from higher "scopes", where a scope is defined as any new indentation block. In addition, the `for` loop iteration variable is now scoped to the loop body, so it is finally possible to write ```mojo for i in range(1): pass for i in range(2): pass ``` * 📢 Mojo now supports an `@value` decorator on structs to reduce boilerplate and encourage best practices in value semantics. The `@value` decorator looks to see the struct has a fieldwise initializer (which has arguments for each field of the struct), a `__copyinit__` method, and a `__moveinit__` method, and synthesizes the missing ones if possible. For example, if you write: ```mojo @value struct MyPet: var name: String var age: Int ``` The `@value` decorator will synthesize the following members for you: ```mojo fn __init__(inout self, owned name: String, age: Int): self.name = name^ self.age = age fn __copyinit__(inout self, existing: Self): self.name = existing.name self.age = existing.age fn __moveinit__(inout self, owned existing: Self): self.name = existing.name^ self.age = existing.age ``` This decorator can greatly reduce the boilerplate needed to define common aggregates, and gives you best practices in ownership management automatically. The `@value` decorator can be used with types that need custom copy constructors (your definition wins). We can explore having the decorator take arguments to further customize its behavior in the future. * 📚 Memcpy and memcmp now consistently use count as the byte count. * 📚 Add a variadic string join on strings. * 📚 Introduce a `reduce_bit_count` method to count the number of 1 across all elements in a SIMD vector. * 📚 Optimize the `pow` function if the exponent is integral. * 📚 Add a `len` function which dispatches to `__len__` across the different structs that support it. ### Week of 2023-04-17 * 📢 Error messages have been significantly improved, thanks to prettier printing for Mojo types in diagnostics. * 📢 Variadic values can now be indexed directly without wrapping them in a `VariadicList`! * 📢 `let` declarations in a function can now be lazily initialized, and `var` declarations that are never mutated get a warning suggesting they be converted to a `let` declaration. Lazy initialization allows more flexible patterns of initialization than requiring the initializer be inline, e.g.: ```mojo let x: Int if cond: x = foo() else: x = bar() use(x) ``` * 📢 Functions defined with `def` now return `object` by default, instead of `None`. This means you can return values (convertible to `object`) inside `def` functions without specifying a return type. * 📢 The `@raises` decorator has been removed. Raising `fn` should be declared by specifying `raises` after the function argument list. The rationale is that `raises` is part of the type system, instead of a function modifier. * 📢 The `BoolLiteral` type has been removed. Mojo now emits `True` and `False` directly as `Bool`. * 📢 Syntax for function types has been added. You can now write function types with `fn(Int) -> String` or `async def(&String, *Int) -> None`. No more writing `!kgen.signature` types by hand! * 📢 Float literals are not emitted as `FloatLiteral` instead of an MLIR `f64` type! * 📢 Automatic destructors are now supported by Mojo types, currently spelled `fn __del___(owned self):` (the extra underscore will be dropped shortly). These destructors work like Python object destructors and similar to C++ destructors, with the major difference being that they run "as soon as possible" after the last use of a value. This means they are not suitable for use in C++-style RAII patterns (use the `with` statement for that, which is currently unsupported). These should be generally reliable for both memory-only and register-passable types, with the caveat that closures are known to *not* capture values correctly. Be very careful with interesting types in the vicinity of a closure! * A new (extremely dangerous!) builtin function is available for low-level ownership muckery. The `__get_address_as_owned_value(x)` builtin takes a low-level address value (of `!kgen.pointer` type) and returns an `owned` value for the memory that is pointed to. This value is assumed live at the invocation of the builtin, but is "owned" so it needs to be consumed by the caller, otherwise it will be automatically destroyed. This is an effective way to do a "placement delete" on a pointer. ```mojo # "Placement delete": destroy the initialized object begin pointed to. _ = __get_address_as_owned_value(somePointer.value) # Result value can be consumed by anything that takes it as an 'owned' # argument as well. consume(__get_address_as_owned_value(somePointer.value)) ``` * Another magic operator, named `__get_address_as_uninit_lvalue(x)` joins the magic LValue operator family. This operator projects a pointer to an LValue like `__get_address_as_lvalue(x)`. The difference is that `__get_address_as_uninit_lvalue(x)` tells the compiler that the pointee is uninitialized on entry and initialized on exit, which means that you can use it as a "placement new" in C++ sense. `__get_address_as_lvalue(x)` tells the compiler that the pointee is initialized already, so reassigning over it will run the destructor. ```mojo # "*Re*placement new": destroy the existing SomeHeavy value in the memory, # then initialize a new value into the slot. __get_address_as_lvalue(somePointer.value) = SomeHeavy(4, 5) # Ok to use an lvalue, convert to borrow etc. use(__get_address_as_lvalue(somePointer.value)) # "Placement new": Initialize a new value into uninitialied memory. __get_address_as_uninit_lvalue(somePointer.value) = SomeHeavy(4, 5) # Error, cannot read from uninitialized memory. use(__get_address_as_uninit_lvalue(somePointer.value)) ``` Note that `__get_address_as_lvalue` assumes that there is already a value at the specified address, so the assignment above will run the `SomeHeavy` destructor (if any) before reassigning over the value. * 📢 Implement full support for `__moveinit__` (aka move constructors) This implements the ability for memory-only types to define two different types of move ctors if they'd like: 1. `fn __moveinit__(inout self, owned existing: Self)`: Traditional Rust style moving constructors that shuffles data around while taking ownership of the source binding. 2. `fn __moveinit__(inout self, inout existing: Self):`: C++ style "stealing" move constructors that can be used to take from an arbitrary LValue. This gives us great expressive capability (better than Rust/C++/Swift) and composes naturally into our lifetime tracking and value categorization system. * The `__call__` method of a callable type has been relaxed to take `self` by borrow, allow non-copyable callees to be called. * Implicit conversions are now invoked in `raise` statements properly, allowing converting strings to `Error` type. * Automatic destructors are turned on for `__del__` instead of `__del___`. * 📚 Add the builtin FloatLiteral type. * 📚 Add integral `floordiv` and `mod` for the SIMD type that handle negative values. * 📚 Add an F64 to String converter. * 📚 Make the `print` function take variadic inputs. ### Week of 2023-04-10 * 📢 Introduce consume operator `x^` This introduces the postfix consume operator, which produces an RValue given a lifetime tracked object (and, someday, a movable LValue). * Mojo now automatically synthesizes empty destructor methods for certain types when needed. * The `object` type has been built out into a fully-dynamic type, with dynamic function dispatch, with full error handling support. ```mojo def foo(a) -> object: return (a + 3.45) Self:`. The `T{}` initializer syntax has been removed for memory-primary types. * Mojo String literals now emit a builtin `StringLiteral` type! One less MLIR type to worry about. * New `__getattr__` and `__setattr__` dunder methods were added. Mojo calls these methods on a type when attempting member lookup of a non-static member. This allows writing dynamic objects like `x.foo()` where `foo` is not a member of `x`. * Early destructor support has been added. Types can now define a special destructor method `__del___` (note three underscores). This is an early feature and it is still being built out. There are many caveats, bugs, and missing pieces. Stay tuned! * 📚 Integer division and mod have been corrected for rounding in the presence of negative numbers. * 📚 Add scalar types (UI8, SI32, F32, F64, etc.) which are aliases to `SIMD[1, type]`. ## March 2023 ### Week of 2023-03-27 * 📢 Parameter names are no longer load-bearing in function signatures. This gives more flexibility in defining higher-order functions, because the functions passed as parameters do not need their parameter names to match. ```mojo # Define a higher-order function... fn generator[ func: __mlir_type[`!kgen.signature() -> !kgen.none`] ](): pass # Int parameter is named "foo". fn f0[foo: Int](): pass # Int parameter is named "bar". fn f1[bar: Int](): pass fn main(): # Both can be used as `func`! generator[f0]() generator[f1]() ``` Stay tuned for improved function type syntax... * 📢 Two magic operators, named `__get_lvalue_as_address(x)` and `__get_address_as_lvalue` convert stored LValues to and from `!kgen.pointer` types (respectively). This is most useful when using the `Pointer[T]` library type. The `Pointer(to=lvalue)` method uses the first one internally. The second one must currently be used explicitly, and can be used to project a pointer to a reference that you can pass around and use as a self value, for example: ```mojo # "Replacement new" SomeHeavy value into the memory pointed to by a # Pointer[SomeHeavy]. __get_address_as_lvalue(somePointer.value) = SomeHeavy(4, 5) ``` Note that `__get_address_as_lvalue` assumes that there is already a value at the specified address, so the assignment above will run the `SomeHeavy` destructor (if any) before reassigning over the value. * The `(((x)))` syntax is \_\_mlir\_op has been removed in favor of `__get_lvalue_as_address` which solves the same problem and is more general. * 📢 When using a mutable `self` argument to a struct `__init__` method, it now must be declared with `&`, like any other mutable method. This clarifies the mutation model by making `__init__` consistent with other mutating methods. * 📚 Add variadic string join function. * 📚 Default initialize values with 0 or null if possible. * 📚 Add compressed, aligned, and mask store intrinsics. ### Week of 2023-03-20 * Initial `String` type is added to the standard library with some very basic methods. * Add `DimList` to remove the need to use an MLIR list type throughout the standard library. * 📢 The `__clone__` method for copying a value is now named `__copy__` to better follow Python term of art. * 📢 The `__copy__` method now takes its self argument as a "read" value, instead of taking it by reference. This makes it easier to write, works for `@register_passable` types, and exposes more optimization opportunities to the early optimizer and dataflow analysis passes. ```mojo # Before: fn __clone__(inout self) -> Self: ... # After: fn __copy__(self) -> Self: ... ``` * 📢 A new `@register_passable("trivial")` may be applied to structs that have no need for a custom `__copy__` or `__del__` method, and whose state is only made up of `@register_passable("trivial")` types. This eliminates the need to define `__copy__` boilerplate and reduces the amount of IR generated by the compiler for trivial types like `Int`. * You can now write back to attributes of structs that are produced by a computed lvalue expression. For example `a[i].x = ..` works when `a[i]` is produced with a `__getitem__`/`__setitem__` call. This is implemented by performing a read of `a[i]`, updating the temporary, then doing a writeback. * The remaining hurdles to using non-parametric, `@register_passable` types as parameter values have been cleared. Types like `Int` should enjoy full use as parameter values. * Parameter pack inference has been added to function calls. Calls to functions with parameter packs can now elide the pack types: ```mojo fn foo[*Ts: AnyType](*args: *Ts): pass foo(1, 1.2, True, "hello") ``` Note that the syntax for parameter packs has been changed as well. * 📚 Add the runtime string type. * 📚 Introduce the DimList struct to remove the need to use low-level MLIR operations. ### Week of 2023-03-13 * 📢 Initializers for structs now use `__init__` instead of `__new__`, following standard practice in Python. You can write them in one of two styles, either traditional where you mutate self: ```mojo fn __init__(self, x: Int): self.x = x ``` or as a function that returns an instance: ```mojo fn __init__(x: Int) -> Self: return Self {x: x} ``` Note that `@register_passable` types must use the later style. * 📢 The default argument convention is now the `borrowed` convention. A "read" argument is passed like a C++ `const&` so it doesn't need to invoke the copy constructor (aka the `__clone__` method) when passing a value to the function. There are two differences from C++ `const&`: 1. A future borrow checker will make sure there are no mutable aliases with an immutable borrow. 2. `@register_passable` values are passed directly in an SSA register (and thus, usually in a machine register) instead of using an extra reference wrapper. This is more efficient and is the 'right default' for `@register_passable` values like integers and pointers. This also paves the way to remove the reference requirement from `__clone__` method arguments, which will allow us to fill in more support for them. * Support for variadic pack arguments has been added to Mojo. You can now write heterogeneous variadic packs like: ```mojo fn foo[*Ts: AnyType](args*: Ts): pass foo[Int, F32, String, Bool](1, 1.5, "hello", True) ``` * The `owned` argument convention has been added. This argument convention indicates that the function takes ownership of the argument and is responsible for managing its lifetime. * The `borrowed` argument convention has been added. This convention signifies the callee gets an immutable shared reference to a value in the caller's context. * 📚 Add the `getenv` function to the `OS` module to enable getting environment variables. * 📚 Enable the use of dynamic strides in `NDBuffer`. ### Week of 2023-03-06 * 📢 Support added for using capturing async functions as parameters. * 📢 Returning result parameters has been moved from `return` statements to a new `param_return` statement. This allows returning result parameters from throwing functions: ```mojo @raises fn foo[() -> out: Int](): param_return[42] raise Error() ``` And returning different parameters along `@parameter if` branches: ```mojo fn bar[in: Bool -> out: Int](): @parameter if in: param_return[1] else: param_return[2] ``` * 📢 Mojo now supports omitting returns at the end of functions when they would not reachable. For instance, ```mojo fn foo(cond: Bool) -> Int: if cond: return 0 else: return 1 fn bar() -> Int: while True: pass ``` * String literals now support concatenation, so `"hello " "world"` is treated the same as `"hello world"`. * Empty bodies on functions, structs, and control flow statements are no longer allowed. Please use `pass` in them to explicitly mark that they are empty, just like in Python. * 📢 Structs in Mojo now default to living in memory instead of being passed around in registers. This is the right default for generality (large structures, structures whose pointer identity matters, etc) and is a key technology that enables the borrow model. For simple types like `Int` and `SIMD`, they can be marked as `@register_passable`. Note that memory-only types currently have some limitations: they cannot be used in generic algorithms that take and return a `!mlirtype` argument, and they cannot be used in parameter expressions. Because of this, a lot of types have to be marked `@register_passable` just to work around the limitations. We expect to enable these use-cases over time. * 📢 Mojo now supports computed lvalues, which means you can finally assign to subscript expressions instead of having to call `__setitem__` explicitly. Some details on this: Mojo allows you to define multiple `__setitem__` overloads, but will pick the one that matches your `__getitem__` type if present. It allows you to pass computed lvalues into inout arguments by introducing a temporary copy of the value in question. * Mojo now has much better support for using register-primary struct types in parameter expressions and as the types of parameter values. This will allow migration of many standard library types away from using bare MLIR types like `__mlir_type.index` and towards using `Int`. This moves us towards getting rid of MLIR types everywhere and makes struct types first-class citizens in the parameter system. * 📚 Add a `sort` function. * 📚 Add non-temporal store to enable cache bypass. ## February 2023 ### Week of 2023-02-27 * 📢 The `@interface`, `@implements`, and `@evaluator` trio of decorators have been removed, replaced by the `@parameter if` and `@adaptive` features. * 📢 Parameter inference can now infer the type of variadic lists. * 📢 Memory primary types are now supported in function results. A result slot is allocated in the caller, and the callee writes the result of the function into that slow. This is more efficient for large types that don't fit into registers neatly! And initializers for memory-primary types now initialize the value in-place, instead of emitting a copy! * Support for `let` decls of memory primary types has been implemented. These are constant, ready-only values of memory primary types but which are allocated on the function stack. * Overload conversion resolution and parameter inference has been improved: 1. Inference now works with `let` decls in some scenarios that weren't working before. 2. Parameter bindings can now infer types into parameter expressions. This helps resolve higher-order functions in parameter expressions. * 📚 Optimize floor, ceil, and ldexp on X86 hardware. * 📚 Implement the log math function. ### Week of 2023-02-20 * 📢 A new `@__memory_primary` struct decorator has been introduced. Memory primary types must always have an address. For instance, they are always stack-allocated when declared in a function and their values are passed into function calls by address instead of copy. This is in contract with register primary types that may not have an address, and which are passed by value in function calls. Memory-primary fields are not allowed inside register-primary structs, because struct elements are stored in-line. * 📢 A new `_CompilerBuiltin` module was added. This module defines core types and functions of the language that are referenced by the parser, and hence, is auto-imported by all other modules. For example new types for literal values like the boolean True/False will be included in `_CompilerBuiltin`. * 📢 A special `__adaptive_set` property can be accessed on a function reference marked as `@adaptive`. The property returns the adaptive overload set of that function. The return type is a `!kgen.variadic`. This feature is useful to implement a generic `evaluate` function in the standard library. * 📢 A new built-in literal type `BoolLiteral` was added in `_CompilerBuiltin`. It represents the literal boolean values `True` and `False`. This is the first Mojo literal to be emitted as a standard library type! * 📚 Add the prefetch intrinsic to enable HW prefetching a cache line. * 📚 Add the `InlinedFixedVector`, which is optimized for small vectors and stores values on both the stack and the heap. ### Week of 2023-02-13 * Unqualified lookups of struct members apply contextual parameters. This means for instance that you can refer to static methods without binding the struct parameters. ```mojo struct Foo[x: Int]: @staticmethod bar(): pass foo(self): bar() # implicitly binds to Foo[x].bar() Foo[2].bar() # explicitly bind to another parameter ``` * 📢 A new `Self` type refers to the enclosing type with all parameters bound to their current values. This is useful when working with complex parametric types, e.g.: ```mojo struct MyArray[size: Int, element_type: type]: fn __new__() -> Self: return Self {...} ``` which is a lot nicer than having to say `MyArray[size, element_type]` over and over again. * 📢 Mojo now supports an `@adaptive` decorator. This decorator will supersede interfaces, and it represents an overloaded function that is allowed to resolve to multiple valid candidates. In that case, the call is emitted as a fork, resulting in multiple function candidates to search over. ```mojo @adaptive fn sort(arr: ArraySlice[Int]): bubble_sort(arr) @adaptive fn sort(arr: ArraySlice[Int]): merge_sort(arr) fn concat_and_sort(lhs: ArraySlice[Int], rhs: ArraySlice[Int]): let arr = lhs + rhs sort(arr) # this forks compilation, creating two instances # of the surrounding function ``` * 📢 Mojo now requires that types implement the `__clone__` special member in order to copy them. This allows the safe definition of non-copyable types like Atomic. Note that Mojo still doesn't implement destructors, and (due to the absence of non-mutable references) it doesn't actually invoke the `__clone__` member when copying a let value. As such, this forces to you as a Mojo user to write maximal boilerplate without getting much value out of it. In the future, we will reduce the boilerplate with decorators, and we will actually start using it. This will take some time to build out though. * 📢 A special `__mlir_region` statement was added to provide stronger invariants around defining MLIR operation regions in Mojo. It similar syntax to function declarations, except it there are no results and no input conventions. * 📚 Implement the log math function. * 📚 Improve the DType struct to enable compile-time equality checks. * 📚 Add the Complex struct class. ### Week of 2023-02-06 * 📢 The `if` statement now supports a `@parameter` decorator, which requires its condition to be a parameter expression, but which only emits the 'True' side of the condition to the binary, providing a "static if" functionality. This should eliminate many uses of `@interface` that are just used to provide different constraint on the implementations. * 📢 `fn main():` is now automatically exported and directly runnable by the command-line `mojo` tool. This is a stop-gap solution to enable script-like use cases until we have more of the language built out. * 🪦 The `@nodebug_inline` feature has been removed, please use `@alwaysinline("nodebug")` for methods that must be inlined and that we don't want to step into. * 📢 Python chained comparisons, ex. `a Int: return a + b + c async fn call_it(): let task: Coroutine[Int] = add_three(1, 2, 3) print(await task) ``` * ⭐️ We now diagnose unused expression values at statement context in `fn` declarations (but not in `def`s). This catches bugs with unused values, e.g. when you forget the parens to call a function. * 📢 An `@always_inline("nodebug")` function decorator can be used on functions that need to be force inlined, but when they should not have debug info in the result. This should be used on methods like `Int.__add__` which should be treated as builtin. * 📢 The `@export` decorator now supports an explicit symbol name to export to, for example: ```mojo @export("baz") # exported as 'baz' fn some_mojo_fn_name(): ``` * 📢 🚧 Subscript syntax is now wired up to the `__getitem__` dunder method. This allows type authors to implement the `__getitem__` method to enable values to be subscripted. This is an extended version of the Python semantics (given we support overloading) that allows you to define N indices instead of a single version that takes a tuple (also convenient because we don't have tuples yet). Note that this has a very, very important limitation: subscripts are NOT wired up to `__setitem__` yet. This means that you can read values with `.. = v[i]` but you cannot store to them with `v[i] = ..`. For this, please continue to call `__setitem__` directly. * 📢 Function calls support parameter inference. For calls to functions that have an insufficient number of parameters specified at the callsite, we can now infer them from the argument list. We do this by matching up the parallel type structure to infer what the parameters must be. Note that this works left to right in the parameter list, applying explicitly specified parameters before trying to infer new ones. This is similar to how C++ does things, which means that you may want to reorder the list of parameters with this in mind. For example, a `dyn_cast`-like function will be more elegant when implemented as: `fn dyn_cast[DstType: type, SrcType: type](src: SrcType) -> DstType:` Than with the `SrcType`/`DstType` parameters flipped around. * 📚 Add the growable Dynamic vector struct. ### Week of 2023-01-23 * Inplace operations like `+=`/`__iadd__` may now take `self` by-val if they want to, instead of requiring it to be by-ref. * ⭐️ Inplace operations are no longer allowed to return a non-None value. The corresponding syntax is a statement, not an expression. * A new `TaskGroup` type was added to the standard library. This type can be used to schedule multiple tasks on a multi-threaded workqueue to be executed in parallel. An async function can `await` all the tasks at once with the taskgroup. * 📢 We now support for loops! A type that defines an `__iter__` method that returns a type that defines `__next__` and `__len__` methods is eligible to be used in the statement `for el in X()`. Control flow exits the loop when the length is zero. This means things like this now work: ```mojo for item in range(start, end, step): print(item) ``` * Result parameters now have names. This is useful for referring to result parameters in the return types of a function: ```mojo fn return_simd[() -> nelts: Int]() -> SIMD[f32, nelts]: ``` * 📢 We now support homogeneous variadics in value argument lists, using the standard Python `fn thing(*args: Int):` syntax! Variadics also have support in parameter lists: ```mojo fn variadic_params_and_args[*a: Int](*b: Int): print(a[0]) print(b[1]) ``` * 📚 Add the range struct to enable `for ... range(...)` loops. * 📚 Introduce the unroll generator to allow one to unroll loops via a library function. ### Week of 2023-01-16 * 📢 Struct field references are now supported in parameter context, so you can use `someInt.value` to get the underlying MLIR thing out of it. This should allow using first-class types in parameters more widely. * 📢 We now support "pretty" initialization syntax for structs, e.g.: ```mojo struct Int: var value: __mlir_type.index fn __new__(value: __mlir_type.index) -> Int: return Int {value: value} ``` This eliminates the need to directly use the MLIR `lit.struct.create` op in struct initializers. This syntax may change in the future when ownership comes in, because we will be able to support the standard `__init__` model then. * 📢 It is now possible to attach regions to `__mlir_op` operations. This is done with a hack that allows an optional `_region` attribute that lists references to the region bodies (max 1 region right now due to lack of list `[]` literal). * Nested functions now parse, e.g.: ```mojo fn foo(): fn bar(): pass bar() ``` * Python-style `async` functions should now work and the `await` expression prefix is now supported. This provides the joy of async/await syntactic sugar when working with asynchronous functions. This is still somewhat dangerous to use because we don't have proper memory ownership support yet. * String literals are now supported. * Return processing is now handled by a dataflow pass inside the compiler, so it is possible to return early out of if statements. * The parser now supports generating 'fixit' hints on diagnostics, and uses them when a dictionary literal uses a colon instead of equal, e.g.: ```log x.mojo:8:48: error: expected ':' in subscript slice, not '=' return __mlir_op.`lit.struct.create`[value = 42]() ^ : ``` * 📚 Add reduction methods which operate on buffers. * 📚 Add more math functions like sigmoid, sqrt, rsqrt, etc. * 📚 Add partial load / store which enable loads and stores that are predicated on a condition. ### Week of 2023-01-09 * The `/` and `*` markers in function signatures are now parsed and their invariants are checked. We do not yet support keyword arguments yet though, so they aren't very useful. * Functions now support a new `@nodebug_inline` decorator. (Historical note: this was later replaced with `@alwaysinline("nodebug")`). Many of the things at the bottom level of the Mojo stack are trivial zero-abstraction wrappers around MLIR things, for example, the `+` operator on Int or the `__bool__` method on Bool itself. These operators need to be force inlined even at -O0, but they have some additional things that we need to wrestle with: 1. In no case would a user actually want to step into the `__bool__` method on Bool or the + method on Int. This would be terrible debugger QoI for unless you're debugging Int itself. We need something like `__always_inline__, __nodebug__` attributes that clang uses in headers like xmmintrin.h. 2. Similarly, these "operators" should be treated by users as primitives: they don't want to know about MLIR or internal implementation details of Int. 3. These trivial zero abstraction things should be eliminated early in the compiler pipeline so they don't slow down the compiler, bloating out the call graph with trivial leaves. Such thing slows down the elaborator, interferes with basic MLIR things like fold(), bloats out the IR, or bloats out generated debug info. 4. In a parameter context, we want some of these things to get inlined so they can be simplified by the attribute logic and play more nicely with canonical types. This is just a nice to have thing those of us who have to stare at generated IR. The solution to this is a new `@nodebug_inline` decorator. This decorator causes the parser to force-inline the callee instead of generating a call to it. While doing so, it gives the operations the location of the call itself (that's the "nodebug" part) and strips out let decls that were part of the internal implementation details. This is a super-power-user-feature intended for those building the standard library itself, so it is intentionally limited in power and scope: It can only be used on small functions, it doesn't support regions, by-ref, throws, async, etc. * Separately, we now support an `@alwaysInline` decorator on functions. This is a general decorator that works on any function, and indicates that the function must be inlined. Unlike `@nodebug_inline`, this kind of inlining is performed later in the compilation pipeline. * The `__include` hack has been removed now that we have proper import support. * `__mlir_op` can now get address of l-value: You can use magic `(((x)))` syntax in \_\_mlir\_op that forces the `x` expression to be an lvalue, and yields its address. This provides an escape hatch (isolated off in `__mlir_op` land) that allows unsafe access to lvalue addresses. * We now support `__rlshift__` and `__rtruediv__`. * 📢 The parser now resolves scoped alias references. This allows us to support things like `SomeType.someAlias`, forward substituting the value. This unblocks use of aliases in types like `DType`. We'd like to eventually preserve the reference in the AST, but this unblocks library development. * 📚 Add a `now` function and `Benchmark` struct to enable timing and benchmarking. * 📚 Move more of the computation in NDBuffer from runtime to compile time if possible (e.g. when the dimensions are known at compile time). ### Week of 2023-01-02 * 📚 Added the `print` function which works on Integers and SIMD values. * The frontend now has a new diagnostic subsystem used by the `kgen` tool (but not by `kgen-translate` for tests) that supports source ranges on diagnostics. Before we'd emit an error like: ```log x.mojo:13:3: error: invalid call to 'callee': in argument #0, value of type '$F32::F32' cannot be converted to expected type '$int::Int' callee(1.0+F32(2.0)) ^ x.lit:4:1: note: function declared here fn callee(a: Int): ^ ``` now we produce: ```log x.mojo:13:3: error: invalid call to 'callee': in argument #0, value of type '$F32::F32' cannot be converted to expected type '$int::Int' callee(1.0+F32(2.0)) ^ ~~~~~~~~~~~~ x.lit:4:1: note: function declared here fn callee(a: Int): ^ ``` * 📢 Parameter results are now supported in a proper way. They are now forward declared with an alias declaration and then bound in a call with an arrow, e.g.: ```mojo alias a: __mlir_type.index alias b: __mlir_type.index idx_result_params[xyz * 2 -> a, b]() ``` * Various minor issues with implicit conversions are fixed. For instances, implicit conversions are now supported in parameter binding contexts and `alias` declarations with explicit types. * Doc strings are allowed on functions and structs, but they are currently discarded by the parser. * 📚 Add a `print` method!!! * 📚 Demonstrate a naive matmul in Mojo. * 📚 Initial work on functions that depend on types (e.g. FPUtils, nan, inf, etc.) * 📚 Allow one to query hardware properties such as simd\_width, os, etc. via TargetInfo at compile time. ## December 2022 ### Week of 2022-12-26 * 📢 You can now call functions in a parameter context! Calling a function in a parameter context will evaluate the function at compile time. The result can then be used as parameter values. For example, ```mojo fn fma(x: Int, y: Int, z: Int) -> Int: return a + b * c fn parameter_call(): alias nelts = fma(32, 2, 16) var x: SIMD[f32, nelts] ``` * You can now disable printing of types in an `__mlir_attr` substitution by using unary `+` expression. * 📢 `let` declarations are now supported in functions. `let` declarations are local run-time constant values, which are always rvalues. They complement 'var' decls (which are mutable lvalues) and are the normal thing to use in most cases. They also generate less IR and are always in SSA form when initialized. We will want to extend this to support 'let' decls in structs at some point and support lazy initialized 'let' declarations (using dataflow analysis) but that isn't supported yet. * 📚 Add the NDBuffer struct. * Happy new year. ### Week of 2022-12-19 * 📚 Start of the Standard library: 1. Added Integer and SIMD structs to bootstrap the standard library. 2. Added very basic buffer data structure. * We have basic support for parsing parameter results in function calls! Result parameters are an important Mojo metaprogramming feature. They allow functions to return compile-time constants. ```mojo fn get_preferred_simdwidthof[() -> nelts: Int](): return[2] fn vectorized_function(): get_preferred_simdwidthof[() -> nelts]() var x: SIMD[f32, nelts] ``` * Types can now be used as parameters of `!kgen.mlirtype` in many more cases. * MLIR operations with zero results don't need to specify `_type: []` anymore. * We support parsing triple quoted strings, for writing docstrings for your functions and structs! * A new `__mlir_type[a,b,c]` syntax is available for substituting into MLIR types and attributes is available, and the old placeholder approach is removed. This approach has a few advantages beyond what placeholders do: 1. It's simpler. 2. It doesn't form the intermediate result with placeholders, which gets rejected by MLIR's semantic analysis, e.g. the complex case couldn't be expressed before. 3. It provides a simple way to break long attrs/types across multiple lines. * We now support an `@evaluator` decorator on functions for KGEN evaluators. This enables specifying user-defined interface evaluators when performing search during compilation. * 📢 `import` syntax is now supported! This handles packaging imported modules into file ops, enables effective isolation from the other decls. "import" into the desired context is just aliasing decls, with the proper symbols references handle automatically during IR generation. As a starting point, this doesn't handle any notion of packages (as those haven't been sketched out enough). * 📢 Reversed binary operators (like `__radd__`) are now looked up and used if the forward version (like `__add__`) doesn't work for some reason. * 📢 Implicit conversions are now generally available, e.g. in assign statements, variable initializers etc. There are probably a few more places they should work, but we can start eliminating all the extraneous explicit casts from literals now. * Happy Holidays ### Week of 2022-12-12 * 📢 Function overloading now works. Call resolution filters candidate list according to the actual parameter and value argument specified at the site of the call, diagnosing an error if none of the candidates are viable or if multiple are viable and ambiguous. We also consider implicit conversions in overload look: ```mojo fn foo(x: Int): pass fn foo(x: F64): pass foo(Int(1)) # resolves to the first overload foo(1.0) # resolves to the second overload foo(1) # error: both candidates viable with 1 implicit conversion! ``` * The short circuiting binary `and` and `or` expressions are now supported. * Unary operator processing is a lot more robust, now handling the `not` expression and `~x` on Bool. * 📢 The compiler now generates debug information for use with GDB/LLDB that describes variables and functions. * The first version of the Mojo Visual Studio Code extension has been released! It supports syntax highlighting for Mojo files. * The first version of the `Bool` type has landed in the new Mojo standard library! * 📢 Implicit conversions are now supported in return statements. ### Week of 2022-12-05 * "Discard" patterns are now supported, e.g. `_ = foo()` * We now support implicit conversions in function call arguments, e.g. converting an `index` value to `Int` automatically. This eliminates a bunch of casts, e.g. the need to say F32(1.0) everywhere. This is limited for a few reasons that will be improved later: 1. We don't support overloading, so lots of types aren't convertible from all the things they should be, e.g. you can't pass "1" to something that expects F32, because F32 can't be created from index. 2. This doesn't "check to see if we can invoke `__new__`" it force applies it on a mismatch, which leads to poor QoI. 3. This doesn't fix things that need radd. ## November 2022 ### Week of 2022-11-28 * 📢 We support the `True` and `False` keywords as expressions. * 📢 A new `alias` declaration is supported which allows defining local parameter values. This will eventually subsume type aliases and other things as it gets built out. * 📢 We now have end-to-end execution of Mojo files using the `kgen` tool! Functions exported with `@export` can be executed. * 📢 We have try-except-else and `raise` statements and implicit error propagation! The error semantics are that `def` can raise by default, but `fn` must explicitly declare raising with a `@raises` decorator. Stub out basic `Error` type. * The `&` sigil for by-ref arguments is now specified after the identifier. Postfix works better for ref and move operators on the expression side because it chains an mentally associates correctly: `thing.method().result^`. We don't do that yet, but align param decl syntax to it so that things won't be odd looking when we do. In practice this looks like: ```mojo def mutate_argument(a&: index): a = 25 ``` ### Week of 2022-11-21 * 📢 The magic `index` type is gone. Long live `__mlir_type.index`. * Implement parameter substitution into parametric `__mlir_type` decls. This allows us to define parametric opaque MLIR types with exposed parameters using a new "placeholder" attribute. This allows us to expose the power of the KGEN type parametric system directly into Mojo. * 📢 Fully-parametric custom types can now be defined and work in Mojo, bringing together a lot of the recent work. We can write the SIMD type directly as a wrapper around the KGEN type, for example: ```mojo struct SIMD[dt: __mlir_type.`!kgen.dtype`, nelts: __mlir_type.index]: var value: __mlir_type.`!pop.simd, #lit>`[nelts, dt] fn __add__(self, rhs: SIMD[dt, nelts]) -> SIMD[dt, nelts]: return __mlir_op.`pop.add`(self.value, rhs.value) ``` ### Week of 2022-11-14 * 📢 Implement a magic `__mlir_type` declaration that can be used to access any MLIR type. E.g. `__mlir_type.f64`. * 📢 Add an `fn` declaration. These are like `def` declarations, but are more strict in a few ways: they require type annotations on arguments, don't allow implicit variable declarations in their body, and make their arguments rvalues instead of lvalues. * Implemented Swift-style backtick identifiers, which are useful for code migration where names may collide with new keywords. * 📢 A new `__include` directive has been added that performs source-level textual includes. This is temporary until we have an `import` model. * Implement IR generation for arithmetic operators like `+` and `*` in terms of the `__add__` and `__mul__` methods. * 📢 Added support for `break` and `continue` statements, as well as early returns inside loops and conditionals! * 📢 Implemented augmented assignment operators, like `+=` and `@=`. * 📢 Mojo now has access to generating any MLIR operations (without regions) with a new `__mlir_op` magic declaration. We can start to build out the language's builtin types with this: ```mojo struct Int: var value: __mlir_type.index fn __add__(self, rhs: Int) -> Int: return __mlir_op.`index.add`(self.value, rhs.value) ``` Attributes can be attached to the declaration with subscript `[]` syntax, and an explicit result type can be specified with a special `_type` attribute if it cannot be inferred. Attributes can be accessed via the `__mlir_attr` magic decl: ```mojo __mlir_op.`index.cmp`[ _type: __mlir_type.i1, pred: __mlir_attr.`#index` ](lhs, rhs) ``` * Improved diagnostics emissions with ranges! Now errors highlight the whole section of code and not just the first character. ### Week of 2022-11-07 * Implemented the `@interface` and `@implements` decorators, which provide access to KGEN generator interfaces. A function marked as an `@interface` has no body, but it can be implemented by multiple other functions. ```mojo @interface def add(lhs: index, rhs: index): @implements(add) def normal_add(lhs: index, rhs: index) -> index: return lhs + rhs @implements(add) def slow_add(lhs: index, rhs: index) -> index: wait(1000) return normal_add(lhs, rhs) ``` * 📢 Support for static struct methods and initializer syntax has been added. Initializing a struct with `Foo()` calls an implicitly static `__new__` method. This method should be used instead of `__init__` inside structs. ```mojo struct Foo: var value: index def __new__() -> Foo: var result: Foo result.value = Foo.return_a_number() # static method! return result @staticmethod def return_a_number() -> index: return 42 ``` * 📢 Full by-ref argument support. It's now possible to define in-place operators like `__iadd__` and functions like `swap(x, y)` correctly. * 📢 Implemented support for field extract from rvalues, like `x.value` where `x` is not an lvalue (`var` declaration or by-ref function argument). ## October 2022 ### Week of 2022-10-31 * Revised `return` handling so that a return statement with no expression is syntax sugar for `return None`. This enables early exits in functions that implicitly return `None` to be cleaner: ```mojo def just_return(): return ``` * Added support for parsing more expressions: if-else, bitwise operators, shift operators, comparisons, floor division, remainder, and matmul. * 📢 The type of the `self` argument can now be omitted on member methods. ### Week of 2022-10-24 * Added parser support for right-associativity and unary ops, like the power operator `a ** b ** c` and negation operator `-a`. * Add support for `&expr` in Mojo, which allows denoting a by-ref argument in functions. This is required because the `self` type of a struct method is implicitly a pointer. * Implemented support for parametric function declarations, such as: ```mojo struct SIMD[dt: DType, width: index]: fn struct_method(self: &SIMD[dt, width]): pass def fancy_add[dt: DType, width: index]( lhs: SIMD[dt, width], rhs: SIMD[dt, width]) -> index: return width ``` ### Week of 2022-10-17 * Added explicit variable declarations with `var`, for declaring variables both inside functions and structs, with support for type references. Added `index` as a temporary built-in type. ```mojo def foo(lhs: index, rhs: index) -> index: var result: index = lhs + rhs return result ``` * Implemented support for parsing struct declarations and references to type declarations in functions! In `def`, the type can be omitted to signal an object type. ```mojo struct Foo: var member: index def bar(x: Foo, obj) -> index: return x.member ``` * Implemented parser support for `if` statements and `while` loops! ```mojo def if_stmt(c: index, a: index, b: index) -> index: var result: index = 0 if c: result = a else: result = b return result def while_stmt(init: index): while init > 1: init = init - 1 ``` * Significantly improved error emission and handling, allowing the parser to emit multiple errors while parsing a file. ### Week of 2022-10-10 * Added support for parsing integer, float, and string literals. * Implemented parser support for function input parameters and results. You can now write parametric functions like, ```mojo def foo[param: Int](arg: Int) -> Int: result = param + arg return result ``` ### Week of 2022-10-03 * Added some basic parser scaffolding and initial parser productions, including trivial expressions and assignment parser productions. * Implemented basic scope handling and function IR generation, with support for forward declarations. Simple functions like, ```mojo def foo(x: Int): ``` Now parse! But all argument types are hard-coded to the MLIR `index` type. * Added IR emission for simple arithmetic expressions on builtin types, like `x + y`. ## September 2022 ### Week of 2022-09-26 * Mojo's first patch to add a lexer was Sep 27, 2022. * Settled on `[]` for Mojo generics instead of ``. Square brackets are consistent with Python generics and don't have the less than ambiguity other languages have. --- ## Mojo🔥 FAQ We tried to anticipate your questions about Mojo on this page. If this page doesn't answer all your questions, also check out our [community channels](https://www.modular.com/community). ## Motivation ### Why did you build Mojo? We built Mojo to solve an internal challenge at Modular, and we are using it extensively in our systems such as our [MAX Platform](https://www.modular.com/max). As a result, we are extremely committed to its long term success and are investing heavily in it. Our overall mission is to unify AI software and we can’t do that without a unified language that can scale across the AI infrastructure stack. Our current focus is to unify CPU+GPU programming with blazing-fast execution on the [MAX Platform](https://www.modular.com/max). That said, the north star is for Mojo to support the whole gamut of general-purpose programming over time. For a longer answer, read [Why Mojo](/mojo/why-mojo). ### Why is it called Mojo? Mojo means “a magical charm” or “magical powers.” We thought this was a fitting name for a language that brings magical powers to Python, including unlocking an innovative programming model for accelerators and other heterogeneous systems pervasive in AI today. ### Why does Mojo have the 🔥 file extension? We paired Mojo with fire emoji 🔥 as a fun visual way to impart onto users that Mojo empowers them to get their Mojo on—to develop faster and more efficiently than ever before. We also believe that the world can handle a unicode extension at this point, but you can also just use the `.mojo` extension. :) ### What problems does Mojo solve that no other language can? Mojo combines the usability of Python with the systems programming features it’s missing. We are guided more by pragmatism than novelty, but Mojo’s use of [MLIR](https://mlir.llvm.org/) allows it to scale to new exotic hardware types and domains in a way that other languages haven’t demonstrated. It also has caching and distributed compilation built into its core. We also believe Mojo has a good chance of unifying hybrid packages in the broader Python community. ### What kind of developers will benefit the most from Mojo? Mojo’s initial focus is to bring programmability back to AI, enabling AI developers to customize and get the most out of their hardware. As such, Mojo will primarily benefit researchers and other engineers looking to write high-performance AI operations. Over time, Mojo will become much more interesting to the general Python community as it grows to be a superset of Python. We hope this will help lift the vast Python library ecosystem and empower more traditional systems developers that use C, C++, Rust, etc. ### Why build upon Python? Effectively, all AI research and model development happens in Python today, and there’s a good reason for this! Python is a powerful high-level language with clean, simple syntax and a massive ecosystem of libraries. It’s also one of the world's [most popular programming languages](https://www.tiobe.com/tiobe-index/), and we want to help it become even better. At Modular, one of our core principles is meeting customers where they are—our goal is not to further fragment the AI landscape but to unify and simplify AI development workflows. ### Why not enhance CPython (the major Python implementation) instead? We’re thrilled to see a big push to improve [CPython](https://en.wikipedia.org/wiki/CPython) by the existing community, but our goals for Mojo (such as to deploy onto GPUs and other accelerators) need a fundamentally different architecture and compiler approach underlying it. CPython is a significant part of our compatibility approach and powers our Python interoperability. ### Why not enhance another Python implementation (like Codon, PyPy, etc)? Codon and PyPy aim to improve performance compared to CPython, but Mojo’s goals are much deeper than this. Our objective isn’t just to create “a faster Python,” but to enable a whole new layer of systems programming that includes direct access to accelerated hardware, as outlined in [Why Mojo](/mojo/why-mojo). Our technical implementation approach is also very different, for example, we are not relying on heroic compiler and JIT technologies to “devirtualize” Python. Furthermore, solving big challenges for the computing industry is hard and requires a fundamental rethinking of the compiler and runtime infrastructure. This drove us to build an entirely new approach and we’re willing to put in the time required to do it properly (see our blog post about [building a next-generation AI platform](https://www.modular.com/blog/the-case-for-a-next-generation-ai-developer-platform)), rather than tweaking an existing system that would only solve a small part of the problem. ### Why not make Julia better? We think [Julia](https://julialang.org/) is a great language and it has a wonderful community, but Mojo is completely different. While Julia and Mojo might share some goals and look similar as an easy-to-use and high-performance alternative to Python, we’re taking a completely different approach to building Mojo. Notably, Mojo is Python-first and doesn't require existing Python developers to learn a new syntax. Mojo also has a bunch of technical advancements compared to Julia, simply because Mojo is newer and we’ve been able to learn from Julia (and from Swift, Rust, C++ and many others that came before us). For example, Mojo takes a different approach to memory ownership and memory management, it scales down to smaller envelopes, and is designed with AI and MLIR-first principles (though Mojo is not only for AI). That said, we also believe there’s plenty of room for many languages and this isn’t an OR proposition. If you use and love Julia, that's great! We’d love for you to try Mojo and if you find it useful, then that's great too. ## Functionality ### Where can I learn more about Mojo’s features? The best place to start is the [Mojo Manual](/mojo/manual). And if you want to see what features are coming in the future, take a look at [the roadmap](/mojo/roadmap). ### What does it mean that Mojo is designed for MLIR? [MLIR](https://mlir.llvm.org/) provides a flexible infrastructure for building compilers. It’s based upon layers of intermediate representations (IRs) that allow for progressive lowering of any code for any hardware, and it has been widely adopted by the hardware accelerator industry since [its first release](https://blog.google/technology/ai/mlir-accelerating-ai-open-source-infrastructure/). Although you can use MLIR to create a flexible and powerful compiler for any programming language, Mojo is the world’s first language to be built from the ground up with MLIR design principles. This means that Mojo not only offers high-performance compilation for heterogeneous hardware, but it also provides direct programming support for the MLIR intermediate representations. ### Is Mojo only for AI or can it be used for other stuff? Mojo's initial focus is to solve AI programmability challenges. See [here](https://github.com/modular/modular/tree/main/examples/custom_ops) for examples of how to write custom GPU operations. That being said, the goal is to grow Mojo into a general purpose programming language. We use Mojo at Modular to develop AI algorithms, but you can use it for other things like HPC, data transformations, writing pre/post processing operations, and much more. For examples of how Mojo can be used for other general programming tasks, see our [Mojo examples](https://github.com/modular/modular/tree/main/examples/mojo). ### Is Mojo interpreted or compiled? Mojo is a compiled language. [`mojo build`](/mojo/cli/build) performs ahead-of-time (AOT) compilation to save an executable program. [`mojo run`](/mojo/cli/run) performs just-in-time (JIT) compilation to execute a Mojo source file without saving the compiled result. ### How does Mojo compare to Triton Lang? [Triton Lang](https://triton-lang.org/main/index.html) is a specialized programming model for one type of accelerator, whereas Mojo is a more general language that will support more architectures over time and includes a debugger, a full tool suite, etc. For more about embedded domain-specific languages (EDSLs) like Triton, read the “Embedded DSLs in Python” section of [Why Mojo](/mojo/why-mojo#embedded-dsls-in-python). ### How does Mojo help with PyTorch acceleration? We use Mojo as part of the overall Modular AI stack, [MAX](https://www.modular.com/max) which accelerates PyTorch models. Mojo is the language we use to write the MAX’s high-performance CPU and GPU graph operations. ### Does Mojo support distributed execution? Not alone. You will need to leverage the [MAX Platform](https://www.modular.com/max) for that. Mojo is one component of the Modular stack that makes it easier for you to author highly performant, portable CPU and GPU graph operations, but you’ll also need a runtime (or “OS”) that supports graph level transformations and heterogeneous compute, which is provided by MAX. ### Will Mojo support web deployment (such as Wasm or WebGPU)? We haven’t prioritized this functionality yet, but there’s no reason Mojo can’t support it. ### How do I convert Python programs or libraries to Mojo? Mojo is still early and not yet a Python superset, so only simple programs can be brought over as-is with no code changes. We will continue investing in this and build migration tools as the language matures. ### What about interoperability with other languages like C/C++? Yes, we want to enable developers to port code from languages other than Python to Mojo as well. We expect that due to Mojo’s similarity to the C/C++ type systems, migrating code from C/C++ should work well and it’s in [our roadmap](/mojo/roadmap#cc-interop). ### How does Mojo support hardware lowering? Mojo leverages LLVM-level dialects for the hardware targets it supports, and it uses other MLIR-based code-generation backends where applicable. This also means that Mojo is easily extensible to any hardware backend. For more information, read about our vision for [pluggable hardware](https://www.modular.com/hardware). ### Who writes the software to add more hardware support for Mojo? Mojo provides all the language functionality necessary for anyone to extend hardware support. As such, we expect hardware vendors and community members will contribute additional hardware support in the future. ### How does Mojo provide a 35,000x speed-up over Python? Modern CPUs are surprisingly complex and diverse, but Mojo enables systems-level optimizations and flexibility that unlock the features of any device in a way that Python cannot. So the hardware matters for this sort of benchmark, and for the Mandelbrot benchmarks we show in our [launch keynote](https://www.youtube.com/watch?v=-3Kf2ZZU-dg\&t=1543s), we ran them on an [AWS r7iz.metal-16xl](https://aws.amazon.com/ec2/instance-types/r7iz/) machine. For lots more information, check out our 3-part blog post series about [how Mojo gets a 35,000x speedup over Python](https://www.modular.com/blog/how-mojo-gets-a-35-000x-speedup-over-python-part-1). By the way, all the CPU and GPU graph operations that power Modular's [MAX Platfrom](https://www.modular.com/max) are written in Mojo. We also compared our matrix multiplication implementation to other state-of-the-art implementations that are usually written in assembly. To see the results, see [our blog post about unified matrix multiplication](https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication). ## Performance ### Are there any AI related performance benchmarks for Mojo? It’s important to remember that Mojo is a general-purpose programming language, and any AI-related benchmarks will rely heavily upon other framework components. For example, our in-house CPU and GPU graph operations that power Modular's [MAX](https://www.modular.com/max) are all written in Mojo and you can learn more about performance in our [matrix multiplication blog post](https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication). For details about our end-to-end model performance, read about how we measure performance at Modular [here](https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform). ## Mojo SDK ### How can I get access to the SDK? Mojo is included with the `max` conda package. Try it now by following the tutorial to [get started with Mojo](/mojo/manual/get-started). Read more about [why Mojo is bundled with MAX](/max/faq#why-bundle-mojo-with-max). ### Is the Mojo Playground still available? Yes, but it's different. When we first announced Mojo, it was available only through login, in a JupyterLab environment. Now that Mojo is available for local development, we've shut down that service. The new [Mojo Playground](https://developer.modular.com/playground) does not require login. * It provides access to Mojo and the Mojo standard library. It does not have network access, so you can't install additional Mojo or Python packages. * It doesn't include any Python packages by default. In the future, we intend to make some common Python packages available to import in the Playground. * You can download your code or share it as a gist, but there's no mechanism for saving code in the Playground itself. Any changes will be lost when you switch code examples (as well as in the event of a server refresh or update). If you come up with something you want to save, download it or share it using buttons in the Playground toolbar. * There might be some bugs. Please [report issues and feedback on GitHub](https://github.com/modular/modular/issues/new/choose). ### What are the license terms for the SDK? Please read the [Terms of use](https://www.modular.com/legal/terms). ### What operating systems are supported? Currently, we support Ubuntu Linux 20.04/22.04 (64-bit x86) and macOS (Apple silicon). Support for Windows will follow. Until then, you have several options: * Windows users can use [Windows Subsystem for Linux version 2 (WSL 2)](https://learn.microsoft.com/en-us/windows/wsl/install) running a supported Linux distribution. * Intel Mac users can use a [Docker](https://www.docker.com/) container running a supported Linux distribution. * Users on any system can install the SDK on a remote machine running a supported Linux distribution. ### Is there IDE Integration? Yes, we've published an official [Mojo language extension](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo) for VS Code. The extension supports various features including syntax highlighting, code completion, formatting, hover, etc. It works seamlessly with remote-ssh and dev containers to enable remote development in Mojo. ### Does the Mojo SDK collect telemetry? Yes, the Mojo SDK collects some basic system information, basic compiler/runtime events, and crash reports that enable us to identify, analyze, and prioritize Mojo issues. This telemetry is crucial to help us quickly identify problems and improve our products. Without this telemetry, we would have to rely on user-submitted bug reports, and in our decades of experience building developer products, we know that most people don’t do that. The telemetry provides us the insights we need to build better products for you. You can opt-out of the crash report and compiler/runtime telemetry, but package install/update/uninstall events cannot be disabled (see the [MAX SDK terms](https://www.modular.com/legal/max)). To disable crash reports, use this command: ```sh modular config-set crash_reporting.enabled=false ``` To reduce other telemetry to only the required telemetry events, use this command: ```sh modular config-set telemetry.level=0 ``` There are 3 telemetry levels: `0` currently records nothing (unless you're also using MAX, which records hardware information and session durations); `1` records high-level events such as when the compiler is invoked; and `2` records more detail such as the time spend compiling. ## Versioning & compatibility ### What’s the Mojo versioning strategy? Mojo is still in early development and not at a 1.0 version yet. It’s still missing many foundational features, but please take a look at our [roadmap](/mojo/roadmap) to understand where things are headed. As such, the language is evolving rapidly and source stability is not guaranteed. ### How often will you be releasing new versions of Mojo? Mojo development is moving fast and we are regularly releasing updates. Please join the [Mojo Discord channel](http://discord.gg/modular) for notifications and [sign up for our newsletter](https://www.modular.com/modverse#signup) for more coarse-grain updates. ## Open Source ### Will Mojo be open-sourced? We have committed to open-sourcing Mojo in 2026. Mojo is still young, so we will continue to incubate it within Modular until more of its internal architecture is fleshed out. ### Why not develop Mojo in the open from the beginning? Mojo is a big project and has several architectural differences from previous languages. We believe a tight-knit group of engineers with a common vision can move faster than a community effort. This development approach is also well-established from other projects that are now open source (such as LLVM, Clang, Swift, MLIR, etc.). ## Community ### Where can I ask more questions or share feedback? If you have questions about upcoming features or have suggestions for the language, be sure you first read the [Mojo roadmap](/mojo/roadmap), which provides important information about our current priorities and links to our GitHub channels where you can report issues and discuss new features. To get in touch with the Mojo team and developer community, use the resources on our [community page](https://www.modular.com/community). --- ## Mojo🔥 roadmap & sharp edges This document captures the broad plan about how we plan to implement things in Mojo, and some early thoughts about key design decisions. This is not a full design spec for any of these features, but it can provide a "big picture" view of what to expect over time. It is also an acknowledgement of major missing components that we plan to add. ## Overall priorities Mojo is still in early development and many language features will arrive in the coming months. We are highly focused on building Mojo the right way (for the long-term), so we want to fully build-out the core Mojo language features before we work on other dependent features and enhancements. Currently, that means we are focused on the core system programming features that are essential to [Mojo's mission](/mojo/why-mojo), and as outlined in the following sections of this roadmap. In the near-term, we will **not** prioritize "general goodness" work such as: * Adding syntactic sugar and short-hands for Python. * Adding features from other languages that are missing from Python (such as public/private declarations). * Tackling broad Python ecosystem challenges like packaging. If you have encountered any bugs with current Mojo behavior, please [submit an issue on GitHub](https://github.com/modular/modular/issues). If you have ideas about how to improve the core Mojo features, we prefer that you first look for similar topics or start a new conversation about it on [Discord](https://discord.gg/modular). We also consider Mojo to be a new member of the Python family, so if you have suggestions to improve the experience with Python, we encourage you to propose these "general goodness" enhancements through the formal [PEP process](https://peps.python.org/pep-0001/). ### Why not add syntactic sugar or other minor new features? We are frequently asked whether Mojo will add minor features that people love in other languages but that are missing in Python, such as "implicit return" at the end of a function, public/private access control, fixing Python packaging, and various syntactic shorthands. As mentioned above, we are intentionally *not* adding these kinds of features to Mojo right now. There are three major reasons for this: * First, Mojo is still young: we are still "building a house" by laying down major bricks in the type system and adding system programming features that Python lacks. We know we need to implement support for many existing Python features (compatibility is a massive and important goal of Mojo) and this work is not done yet. We have limited engineering bandwidth and want focus on building essential functionality, and we will not debate whether certain syntactic sugar is important or not. * Second, syntactic sugar is like mortar in a building—its best use is to hold the building together by filling in usability gaps. Sugar (and mortar) is problematic to add early into a system: you can run into problems with laying the next bricks because the sugar gets in the way. We have experience building other languages (such as Swift) that added sugar early, which could have been subsumed by more general features if time and care were given to broader evaluation. * Third, the Python community should tackle some of these ideas first. It is important to us that Mojo be a good member of the Python family, not just a language with Pythonic syntax. As such, we don't want to needlessly diverge from Python evolution: adding a bunch of features could lead to problems down the road if Python makes incompatible decisions. Such a future would fracture the community which would cause massively more harm than any minor language feature could offset. For all these reasons, "nice to have" syntactic sugar is not a priority, and we will quickly close such proposals to avoid cluttering the issue tracker. If you'd like to propose a "general goodness" syntactic feature, please do so with the existing [Python PEP process](https://peps.python.org/pep-0000/). If/when Python adopts a feature, Mojo may also add it, because Mojo's goal is to adopt Python's syntax. We are happy with this approach because the Python community is better equipped to evaluate these features, they have mature code bases to evaluate them with, and they have processes and infrastructure for making structured language evolution features. ## Small independent features There are a number of features that are missing that are important to round out the language fully, but which don't depend strongly on other features. These include things like: * Improved package management support. * Many standard library features, including copy-on-write data structures. * Support for "top level code" at file scope. * Algebraic data types like `enum` in Swift/Rust, and pattern matching. * Many standard library types need refinement, including `Optional[T]` and `Result[T, Error]`. ## Ownership and Lifetimes The ownership system is partially implemented, and is expected to get built out in the next couple of months. The basic support for ownership includes features like: * Capture declarations in closures. * Lifetime checker: complain about invalid mutable references. * Lifetime checker: enforce argument exclusivity for mutable references. Mojo has support for a safe `Pointer` type, and it is used in the standard library, but it is still under active development and not very pretty or nice to use right now. ## Traits support Mojo has basic support for [traits](/mojo/manual/traits). Traits allow you to specify a set of requirements for types to implement. Types can implement those requirements to *conform to* the trait. Traits allow you to write generic functions and generic containers, which can work with any type that conforms to a given trait, instead of being hard-coded to work with a specific type. Currently, the only kind of requirements supported by traits are required method signatures. The trait can't provide a default implementation for its required methods, so each conforming type must implement all of the required methods. A number of [built-in traits](/mojo/manual/traits#built-in-traits) are already implemented in the standard library. We plan to expand traits support in future releases. Planned features include: * Support for default implementations of required methods. * Support for a feature like Swift's extensions, allowing you to add a trait to a preexisting type. * Improve support for conditional conformance. ## Classes Mojo still doesn't support classes, the primary thing Python programmers use pervasively! This isn't because we hate dynamism - quite the opposite. It is because we need to get the core language semantics nailed down before adding them. We expect to provide full support for all the dynamic features in Python classes, and want the right framework to hang that off of. When we get here, we will discuss what the right default is: for example, is full Python hash-table dynamism the default? Or do we use a more efficient model by default (e.g. vtable-based dispatch and explicitly declared stored properties) and allow opt'ing into dynamism with a `@dynamic` decorator on the class. More discussion is [in this proposal](https://github.com/modular/modular/blob/main/mojo/proposals/mojo-and-dynamism.md). ## C/C++ Interop Integration to transparently import Clang C/C++ modules. Mojo's type system and C++'s are very compatible, so we should be able to have something pretty nice here. Mojo can leverage Clang to transparently generate a foreign function interface between C/C++ and Mojo, with the ability to directly import functions: ```mojo from "math.h" import cos print(cos(0)) ``` ## Calling Mojo from Python Currently you can call Python code from Mojo, but not the reverse: you can't pass a Mojo callback to a Python function, or build a Python extension in Mojo. We want to support calling Mojo from Python, but we want to do it right and we need the core language to be more mature first. ## Full MLIR decorator reflection All decorators in Mojo have hard-coded behavior in the parser. In time, we will move these decorators to being compile-time metaprograms that use MLIR integration. This may depend on C++ interop for talking to MLIR. This completely opens up the compiler to programmers. Static decorators are functions executed at compile-time with the capability to inspect and modify the IR of functions and types. ```mojo fn value(t: TypeSpec): t.__copyinit__ = # synthesize dunder copyinit automatically @value struct TrivialType: pass fn full_unroll(loop: mlir.Operation): # unrolling of structured loop fn main(): @full_unroll for i in range(10): print(i) ``` ## Sharp Edges The entire Modular kernel library is written in Mojo, and its development has been prioritized based on the internal needs of those users. Given that Mojo is still a young language, there are a litany of missing small features that many Python and systems programmers may expect from their language, as well as features that don't quite work the way we want to yet, and in ways that can be surprising or unexpected. This section of the document describes a variety of "sharp edges" in Mojo, and potentially how to work around them if needed. We expect all of these to be resolved in time, but in the meantime, they are documented here. ### No list or dict comprehensions Mojo does not yet support Python list or dictionary comprehension expressions, like `[x for x in range(10)]`. ### No `lambda` syntax Mojo does not yet support defining anonymous functions with the `lambda` keyword. ### Parametric aliases Mojo aliases can refer to parametric values but cannot themselves have parameter lists. As of v0.6.0, you can create a parametric alias by aliasing an unbound or partially-bound type. For example, the new `Scalar` type is defined as: ```mojo alias Scalar = SIMD[size=1] ``` This creates a parametric alias that you can use like this: ```mojo var i = Scalar[DType.int8] ``` Parametric aliases with an explicit parameter list aren't yet supported: ```mojo alias mul2[x: Int] = x * 2 # Error! ``` ### `Exception` is actually called `Error` In Python, programmers expect that exceptions all subclass the `Exception` builtin class. The only available type for Mojo "exceptions" is `Error`: ```mojo fn raise_an_error() raises: raise Error("I'm an error!") ``` The reason we call this type `Error` instead of `Exception` is because it's not really an exception. It's not an exception, because raising an error does not cause stack unwinding, but most importantly it does not have a stack trace. And without polymorphism, the `Error` type is the only kind of error that can be raised in Mojo right now. ### No Python-style generator functions Mojo does not yet support Python-style generator functions (`yield` syntax). These are "synchronous co-routines" -- functions with multiple suspend points. ### No `async for` or `async with` Although Mojo has support for async functions with `async fn` and `async def`, Mojo does not yet support the `async for` and `async with` statements. ### Scoping and mutability of statement variables Python programmers understand that local variables are implicitly declared and scoped at the function level. As the Mojo Manual explains, this is supported in Mojo for [implicitly-declared variables](/mojo/manual/variables#implicitly-declared-variables). However, there are some nuances to Python's implicit declaration rules that Mojo does not match 1-to-1. For example, the scope of `for` loop iteration variables and caught exceptions in `except` statements is limited to the next indentation block, for both `def` and `fn` functions. Python programmers will expect the following program to print "2": ```python for i in range(3): pass print(i) ``` However, Mojo will complain that `print(i)` is a use of an unknown declaration. This is because whether `i` is defined at this line is dynamic in Python. For instance the following Python program will fail: ```python for i in range(0): pass print(i) ``` With `NameError: name 'i' is not defined`, because the definition of `i` is a dynamic characteristic of the function. Mojo's lifetime tracker is intentionally simple (so lifetimes are easy to use!), and cannot reason that `i` would be defined even when the loop bounds are constant. ### Name scoping of nested function declarations In Python, nested function declarations produce dynamic values. They are essentially syntactic sugar for `bar = lambda ...`. ```python def foo(): def bar(): # creates a function bound to the dynamic value 'bar' pass bar() # indirect call ``` In Mojo, nested function declarations are static, so calls to them are direct unless made otherwise. ```mojo fn foo(): fn bar(): # static function definition bound to 'bar' pass bar() # direct call var f = bar # materialize 'bar' as a dynamic value f() # indirect call ``` Currently, this means you cannot declare two nested functions with the same name. For instance, the following example does not work in Mojo: ```mojo def pick_func(cond) -> def() capturing: if cond: def bar(): return 42 else: def bar(): return 3 # error: redeclaration of 'bar' return bar ``` The functions in each conditional must be explicitly materialized as dynamic values. ```mojo def pick_func(cond) -> def() capturing: var result: def() capturing # Mojo function type if cond: def bar0(): return 42 result = bar0 else: def bar1(): return 3 result = bar1 return result ``` We hope to sort out these oddities with nested function naming as our model of closures in Mojo develops further. ### Limited polymorphism Mojo has implemented static polymorphism through traits, as noted above. We plan to implement dynamic polymorphism through classes and MLIR reflection in the future. Python programmers are used to implementing special dunder methods on their classes to interface with generic methods like `print()` and `len()`. For instance, one expects that implementing `__repr__()` or `__str__()` on a class will enable that class to be printed using `print()`. ```python class One: def __init__(self): pass def __repr__(self): return '1' print(One()) # prints '1' ``` Mojo currently supports similar functionality through the [`Writable`](/mojo/stdlib/utils/write/Writable) trait, so that `print()` works on all `Writable` types. We'll continue to add traits support to the standard library to enable common use cases like this. ### The standard library has limited exceptions use For historic and performance reasons, core standard library types typically do not use exceptions. For instance, `List` will not raise an out-of-bounds access (it will crash), and `Int` does not throw on divide by zero. In other words, most standard library types are considered "unsafe". ```mojo var l = List[Int](capacity=0) print(l[1]) # could crash or print garbage values (undefined behavior) print(1//0) # does not raise and could print anything (undefined behavior) ``` This is clearly unacceptable given the strong memory safety goals of Mojo. We will circle back to this when more language features and language-level optimizations are available. ### Nested functions cannot be recursive Nested functions (any function that is not a top-level function) cannot be recursive in any way. Nested functions are considered "parameters", and although parameter values do not have to obey lexical order, their uses and definitions cannot form a cycle. Current limitations in Mojo mean that nested functions, which are considered parameter values, cannot be cyclic. ```mojo fn try_recursion(): fn bar(x: Int): # error: circular reference : i = 5 print(type(i or s)) # prints ``` In Mojo, given the expression `(a or b)`, the compiler needs to statically determine a result type that the types of `a` and `b` can both be **converted** to. For example, currently an `Int` can be implicitly converted to a `String`, but a `String` can't be implicitly converted to an `Int`. So given an integer value `i` and a string value `s`, the value of `(i or s)` will *always* be a `String`. ### `StringLiteral` behaves differently than `String` String literals behave differently than `String` values in Mojo code. For example: ```mojo fn main(): var g: Int = 0 var h: String = "hello" print(g or h) # prints `hello` print(g or "hello") # prints `True` ``` While the `IntLiteral` and `FloatLiteral` types convert or *materialize* at runtime into `Int` and `Float64` values, respectively, string literals continue to exist at runtime as `StringLiteral` values. This can result in surprising behavior because `StringLiteral` has a more restricted API than `String`. In the example above, because the `or` expression is statically typed, and `Int` cannot be implicitly converted to a `StringLiteral`, the compiler chooses a result type that both `Int` and `StringLiteral` can be converted to—in this case, `Bool`. We plan to address this issue in the future, but in the near term, you can avoid the inconsistency between `StringLiteral` and `String` problems by explicitly converting string literals to `String` values. For example: ```mojo var h: String = "hello" # or print(g or String("hello")) ``` ### Walrus assignment expression limitations The Mojo compiler reports an uninitialized value error if an expression uses multiple "walrus" [assignment expressions](/mojo/manual/operators#assignment-expressions) to declare more than one variable. For example: ```mojo def A() -> Int: return 42 def B() -> String: return "waffles" def main(): if (a := A()) and (b := B()): print("a =", a) print("b =", b) ``` ```output walrus-conditional.mojo:8:14: error: use of uninitialized value 'b' print("b =", b) ^ walrus-conditional.mojo:6:24: note: 'b' declared here if (a := A()) and (b := B()): ^ ``` Ideally, the Mojo compiler should compile this code because the second `print()` statement executes only if a value is assigned to `b`. To work around this limitation you can explicitly initialize `b` before the `if` statement, like this: ```mojo def A() -> Int: return 42 def B() -> String: return "waffles" def main(): b = String() if (a := A()) and (b := B()): print("a =", a) print("b =", b) ``` ```output a = 42 b = waffles ``` --- ## monotonic `monotonic() -> UInt` Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation. **Returns:** The current time in ns. --- ## Movable The Movable trait denotes a type whose value can be moved. Implement the `Movable` trait on `Foo` which requires the `__moveinit__` method: ```mojo struct Foo(Movable): fn __init__(out self): pass fn __moveinit__(out self, owned existing: Self): print("moving") ``` You can now use the ^ suffix to move the object instead of copying it inside generic functions: ```mojo fn return_foo[T: Movable](owned foo: T) -> T: return foo^ var foo = Foo() var res = return_foo(foo^) ``` ```plaintext moving ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. --- ## mul `mul(lhs: IntTuple[origin], rhs: Int) -> IntTuple` Multiply each element in an `IntTuple` by a scalar value. This function creates a new `IntTuple` where each element (at any nesting level) is multiplied by the provided integer value. **Args:** * ​lhs (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied. * ​rhs (`Int`): The scalar integer to multiply each element by. **Returns:** A new `IntTuple` with the same structure as the input but with all elements multiplied by the scalar value. --- ## mul `mul(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## mulhi `mulhi(a: SIMD[uint16, 1], b: SIMD[uint16, 1]) -> SIMD[uint32, 1]` Calculates the most significant 32 bits of the product of two 16-bit unsigned integers. Multiplies two 16-bit unsigned integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.U16 PTX instruction. On others, it performs multiplication using 32-bit arithmetic. **Args:** * ​a (`SIMD[uint16, 1]`): First 16-bit unsigned integer operand. * ​b (`SIMD[uint16, 1]`): Second 16-bit unsigned integer operand. **Returns:** The high 32 bits of the product a \* b `mulhi(a: SIMD[int16, 1], b: SIMD[int16, 1]) -> SIMD[int32, 1]` Calculates the most significant 32 bits of the product of two 16-bit signed integers. Multiplies two 16-bit signed integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.S16 PTX instruction. On others, it performs multiplication using 32-bit arithmetic. **Args:** * ​a (`SIMD[int16, 1]`): First 16-bit signed integer operand. * ​b (`SIMD[int16, 1]`): Second 16-bit signed integer operand. **Returns:** The high 32 bits of the product a \* b `mulhi(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint32, 1]` Calculates the most significant 32 bits of the product of two 32-bit unsigned integers. Multiplies two 32-bit unsigned integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.U32 PTX instruction. On others, it performs multiplication using 64-bit arithmetic. **Args:** * ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand. * ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand. **Returns:** The high 32 bits of the product a \* b `mulhi(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int32, 1]` Calculates the most significant 32 bits of the product of two 32-bit signed integers. Multiplies two 32-bit signed integers and returns the high 32 bits of their product. Useful for fixed-point arithmetic and overflow detection. Note: On NVIDIA GPUs, this maps directly to the MULHI.S32 PTX instruction. On others, it performs multiplication using 64-bit arithmetic. **Args:** * ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand. * ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand. **Returns:** The high 32 bits of the product a \* b --- ## multimem_ld_reduce `multimem_ld_reduce[type: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]` Performs a vectorized load-reduce operation using NVIDIA's multimem feature. This function loads multiple values from global memory and performs a reduction operation across them in a single instruction. It utilizes NVIDIA's multimem feature available on SM90+ GPUs for improved performance. **Constraints:** * Only supported on SM90+ GPUs. * Count must be 2 or 4. * Type must be float32, float16, or bfloat16. **Parameters:** * ​type (`DType`): Data type for the operation (float32, float16, or bfloat16). * ​count (`Int`): Number of elements to load and reduce (2 or 4). * ​reduction (`ReduceOp`): Type of reduction operation to perform. * ​scope (`Scope`): Memory scope for the operation. * ​consistency (`Consistency`): Memory consistency model to use. * ​accum\_type (`DType`): Data type used for accumulation. Defaults to a wider type than input (e.g. float32 for float16 inputs) to maintain precision during reduction. * ​output\_width (`Int`): Width of each output SIMD vector (default 1). **Args:** * ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Pointer to global memory where data will be loaded from. **Returns:** A StaticTuple containing 'count' SIMD vectors of width 'output\_width' holding the results of the load-reduce operation. --- ## multimem_st `multimem_st[type: DType, *, count: Int, scope: Scope, consistency: Consistency, width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], values: StaticTuple[SIMD[type, width], count])` Stages an inline multimem.st instruction. This operation performs a store to all memory locations pointed to by the multimem address using the specified memory consistency model and scope. Notes: * Requires SM90+ GPU architecture (PTX ISA 8.1+). * The address must be a valid multimem address. * Supported type-width combinations must total 32/64/128 bits. * Default memory semantics: weak consistency (when not specified). * Vector stores (.v2/.v4) require matching total size constraints. Example: ```mojo from gpu.memory import * # Store 2 float32 values to multimem address. multimem_st[DType.float32, count=2, scope=Scope.CTA, consistency=Consistency.RELAXED]( addr, StaticTuple[DType.float32, 2](val1, val2) ) # Vector store of 4 float16x2 values. multimem_st[DType.float16, count=4, scope=Scope.CLUSTER, consistency=Consistency.RELEASE, width=2]( addr, StaticTuple[DType.float16, 4](vec1, vec2, vec3, vec4) ) ``` See Also: [PTX ISA Documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem-ld-reduce-multimem-st-multimem-red). **Parameters:** * ​type (`DType`): The data type of elements to store (must be float16, bfloat16, or float32). * ​count (`Int`): Number of vector elements per store operation (2 or 4). * ​scope (`Scope`): Memory scope for visibility of the store operation (CTA/Cluster/GPU/System). * ​consistency (`Consistency`): Memory consistency semantics (weak/relaxed/release). * ​width (`Int`): Vector width modifier for packed data types (default 1). **Args:** * ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Multimem address in global address space pointing to multiple locations. * ​values (`StaticTuple[SIMD[type, width], count]`): Packed SIMD values to store, with count matching the template parameter. --- ## multistage_dual_gemm `multistage_dual_gemm[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, //, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, origin], a: LayoutTensor[a_type, a_layout, origin], b0: LayoutTensor[b_type, b_layout, origin], b1: LayoutTensor[b_type, b_layout, origin], ctx: DeviceContext)` `multistage_dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), num_k_partitions: Int = 1](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b0: NDBuffer[b_type, 2, origin, b_shape], b1: NDBuffer[b_type, 2, origin, b_shape], ctx: DeviceContext)` --- ## multistage_dual_gemm_kernel `multistage_dual_gemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b0: LayoutTensor[b_type, b_layout, MutableAnyOrigin], b1: LayoutTensor[b_type, b_layout, MutableAnyOrigin])` --- ## multistage_dual_mma `multistage_dual_mma[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, //, BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), k_group_size: UInt = UInt(1)](c0: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c1: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_iter_arg: LayoutTensorIter[type, a_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b0_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b1_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b0_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b1_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## multistage_gemm `multistage_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), serial_reduction: Bool = False](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, transpose_b], ctx: DeviceContext)` --- ## multistage_gemm_q `multistage_gemm_q[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, group_size: Int, pack_factor: Int, config: MatmulConfig[a_type, b_type, c_type, True], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, True], ctx: DeviceContext)` --- ## multistage_mma_q `multistage_mma_q[BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, group_size: Int, pack_factor: Int, c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, scales_type: DType, scales_layout: Layout, scales_smem_layout: Layout, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), prefetch_init: Bool = True, continue_prefetch_b: Bool = False, transpose_b_next: Bool = False, b_next_gmem_layout: Layout = Layout(), b_next_smem_layout: Layout = Layout(), next_op_b_iter_alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()](c: LayoutTensor[c_type, c_layout, origin, address_space=AddressSpace(5)], a_iter_arg: LayoutTensorIter[type, a_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_iter_arg: LayoutTensorIter[b_type, b_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_smem_iter_arg: LayoutTensorIter[scales_type, scales_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_iter_arg: LayoutTensorIter[scales_type, scales_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` --- ## multistage_qgemm_kernel `multistage_qgemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_packed_type: DType, b_layout: Layout, group_size: Int, pack_factor: Int, transpose_b: Bool, config: MatmulConfig[a_type, b_packed_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b_packed: LayoutTensor[b_packed_type, b_layout, MutableAnyOrigin])` --- ## mulwide `mulwide(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint64, 1]` Performs a wide multiplication of two 32-bit unsigned integers. Multiplies two 32-bit unsigned integers and returns the full 64-bit result. Useful when the product may exceed 32 bits. Note: On NVIDIA GPUs, this maps directly to the MUL.WIDE.U32 PTX instruction. On others, it performs multiplication using 64-bit casts. **Args:** * ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand. * ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand. **Returns:** The full 64-bit product of a \* b `mulwide(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int64, 1]` Performs a wide multiplication of two 32-bit signed integers. Multiplies two 32-bit signed integers and returns the full 64-bit result. Useful when the product may exceed 32 bits or be negative. Note: On NVIDIA GPUs, this maps directly to the MUL.WIDE.S32 PTX instruction. On others, it performs multiplication using 64-bit casts. **Args:** * ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand. * ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand. **Returns:** The full 64-bit signed product of a \* b --- ## naive_gemv `naive_gemv[c_size: Dim, a_shape: DimList, b_size: Dim, type: DType](c_buf: NDBuffer[type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[type, 2, origin, a_shape], b_buf: NDBuffer[type, 1, origin, __init__[::Intable](b_size)])` --- ## naive_grouped_matmul `naive_grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)` --- ## naive_grouped_matmul_kernel `naive_grouped_matmul_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin])` --- ## Naive2dConvolution `struct Naive2dConvolution[output_type: DType, input_type: DType, filter_type: DType]` Struct wrapper for naive 2d convolution implementation. ## Fields * ​output (`UnsafePointer[SIMD[output_type, 1]]`): * ​input (`UnsafePointer[SIMD[input_type, 1]]`): * ​filter (`UnsafePointer[SIMD[filter_type, 1]]`): * ​pad\_d (`IndexList[2]`): * ​pad\_h (`IndexList[2]`): * ​pad\_w (`IndexList[2]`): * ​stride (`IndexList[3]`): * ​dilation (`IndexList[3]`): * ​num\_groups (`Int`): * ​output\_shape (`IndexList[5]`): * ​input\_shape (`IndexList[5]`): * ​filter\_shape (`IndexList[5]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)` ### `run` `static run(output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)` --- ## named_barrier `named_barrier[num_threads: SIMD[int32, 1], id: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0)]()` Performs a named synchronization barrier at the block level. This function creates a synchronization point using a specific barrier ID, allowing for multiple independent barriers within a thread block. All threads in the block must execute this function with the same barrier ID and thread count before any thread can proceed past the barrier. Notes: * Only supported on NVIDIA GPUs. * Maps directly to the `nvvm.barrier` instruction. * Useful for fine-grained synchronization when different subsets of threads need to synchronize independently. * The barrier ID must not exceed 16. * All threads participating in the barrier must specify the same num\_threads value. **Parameters:** * ​num\_threads (`SIMD[int32, 1]`): The number of threads that must reach the barrier before any can proceed. * ​id (`SIMD[int32, 1]`): The barrier identifier (0-16). Default is 0. --- ## NamedTemporaryFile `struct NamedTemporaryFile` A handle to a temporary file. Example: ```mojo from tempfile import NamedTemporaryFile from pathlib import Path def main(): var p: Path with NamedTemporaryFile(mode="rw") as f: p = f.name f.write("Hello world!") f.seek(0) print( f.read() == "Hello world!" ) print(String(p), p.exists()) #Removed by default ``` Note: `NamedTemporaryFile.__init__` document the arguments. ## Fields * ​name (`String`): Name of the file. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, mode: String = __init__[__mlir_type.!kgen.string]("w"), name: Optional[String] = Optional(None), suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), delete: Bool = True)` Create a named temporary file. This is a wrapper around a `FileHandle`, `os.remove()` is called in the `close()` method if `delete` is True. Can be used as a context manager. When used as a context manager, the `close()` is called when the context manager exits. **Args:** * ​mode (`String`): The mode to open the file in (the mode can be "r" or "w"). * ​name (`Optional[String]`): The name of the temp file. If it is unspecified, then a random name will be provided. * ​suffix (`String`): Suffix to use for the file name if name is not provided. * ​prefix (`String`): Prefix to use for the file name if name is not provided. * ​dir (`Optional[String]`): Directory in which the file will be created. * ​delete (`Bool`): Whether the file is deleted on close. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves constructor for the file handle. **Args:** * ​existing (`Self`): The existing file handle. ### `__del__` `__del__(owned self)` Closes the file handle. ### `close` `close(mut self)` Closes the file handle. ### `read` `read(self, size: Int = -1) -> String` Reads the data from the file. **Args:** * ​size (`Int`): Requested number of bytes to read. **Returns:** The contents of the file. ### `read_bytes` `read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]` Read from file buffer until we have `size` characters or we hit EOF. If `size` is negative or omitted, read until EOF. **Args:** * ​size (`Int`): Requested number of bytes to read. **Returns:** The contents of the file. ### `seek` `seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]` Seeks to the given offset in the file. **Args:** * ​offset (`SIMD[uint64, 1]`): The byte offset to seek to from the start of the file. * ​whence (`SIMD[uint8, 1]`): The reference point for the offset: os.SEEK\_SET = 0: start of file (Default). os.SEEK\_CUR = 1: current position. os.SEEK\_END = 2: end of file. **Returns:** The resulting byte offset from the start of the file. **Raises:** An error if this file handle is invalid, or if file seek returned a failure. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a span of bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file. ### `__enter__` `__enter__(owned self) -> Self` The function to call when entering the context. **Returns:** The file handle. --- ## nan `nan[dtype: DType]() -> SIMD[dtype, 1]` Gets a NaN value for the given dtype. **Constraints:** Can only be used for FP dtypes. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The NaN value of the given dtype. --- ## NDBuffer `@register_passable(trivial)` `struct NDBuffer[mut: Bool, //, type: DType, rank: Int, origin: Origin[mut], shape: DimList = create_unknown[::Int](), strides: DimList = create_unknown[::Int](), *, alignment: Int = 1, address_space: AddressSpace = AddressSpace(0), exclusive: Bool = True]` An N-dimensional buffer. NDBuffer can be parametrized on rank, static dimensions and Dtype. It does not own its underlying pointer. ## Parameters * ​mut (`Bool`): The inferred mutability. * ​type (`DType`): The element type of the buffer. * ​rank (`Int`): The rank of the buffer. * ​origin (`Origin[mut]`): The origin of the memory being addressed. * ​shape (`DimList`): The static size (if known) of the buffer. * ​strides (`DimList`): The strides (if known) of the buffer. * ​alignment (`Int`): The preferred address alignment of the buffer. * ​address\_space (`AddressSpace`): The address space of the buffer. * ​exclusive (`Bool`): The underlying memory allocation of the tensor is known only to be accessible through this pointer. ## Fields * ​data (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): The underlying data for the buffer. The pointer is not owned by the NDBuffer. * ​dynamic\_shape (`IndexList[rank, element_type=uint64]`): The dynamic value of the shape. * ​dynamic\_stride (`IndexList[rank, element_type=uint64]`): The dynamic stride of the buffer. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Default initializer for NDBuffer. By default the fields are all initialized to 0. `@implicit` `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Constructs an NDBuffer with statically known rank, shapes and type. **Constraints:** The rank, shapes, and type are known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the data. `@implicit` `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]) -> Self` Constructs an NDBuffer with statically known rank, shapes and type. **Constraints:** The rank, shapes, and type are known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]`): Span of the data. `@implicit` `__init__(other: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self` Converts NDBuffers between different variants which do not effect the underlying memory representation. E.g. this allows implicit conversion between `NDBuffer[type, rank, DimList(1, 2, 3), DimList(6, 6, 1), alignment=16]` to `NDBuffer[type, rank, DimList(1, 2, 3), DimList.create_unknown[rank](), alignment=4]` **Args:** * ​other (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The other NDBuffer type. `__init__(ptr: UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList) -> Self` Constructs an NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data. * ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span over the data. * ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. `__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data. * ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. `__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self` Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type. **Constraints:** The rank is known. **Args:** * ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Pointer to the data. * ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes. * ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides. ### `__getitem__` `__getitem__(self, *idx: Int) -> SIMD[type, 1]` Gets an element from the buffer from the specified index. **Args:** * ​\*idx (`Int`): Index of the element to retrieve. **Returns:** The value of the element. `__getitem__(self, idx: IndexList[rank, element_type=element_type]) -> SIMD[type, 1]` Gets an element from the buffer from the specified index. **Args:** * ​idx (`IndexList[rank, element_type=element_type]`): Index of the element to retrieve. **Returns:** The value of the element. ### `__setitem__` `__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, 1])` Stores a single value into the buffer at the specified index. **Args:** * ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer. * ​val (`SIMD[type, 1]`): The value to store. `__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], *idx: Int, *, val: SIMD[type, 1])` Stores a single value into the buffer at the specified index. **Args:** * ​\*idx (`Int`): Index of the element to retrieve. * ​val (`SIMD[type, 1]`): The value to store. ### `origin_cast` `origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]` Changes the origin or mutability of a pointer. **Parameters:** * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): Origin of the destination pointer. **Returns:** A new `NDBuffer` object with the same type and the same address, as the original `NDBuffer` and the new specified mutability and origin. ### `get_rank` `get_rank(self) -> Int` Returns the rank of the buffer. **Returns:** The rank of NDBuffer. ### `get_shape` `get_shape(self) -> IndexList[rank]` Returns the shapes of the buffer. **Returns:** A static tuple of size 'rank' representing shapes of the NDBuffer. ### `get_strides` `get_strides(self) -> IndexList[rank]` Returns the strides of the buffer. **Returns:** A static tuple of size 'rank' representing strides of the NDBuffer. ### `get_nd_index` `get_nd_index(self, idx: Int) -> IndexList[rank]` Computes the NDBuffer's ND-index based on the flat index. **Args:** * ​idx (`Int`): The flat index. **Returns:** The index positions. ### `__len__` `__len__(self) -> Int` Computes the NDBuffer's number of elements. **Returns:** The total number of elements in the NDBuffer. ### `num_elements` `num_elements(self) -> Int` Computes the NDBuffer's number of elements. **Returns:** The total number of elements in the NDBuffer. ### `size` `size(self) -> Int` Computes the NDBuffer's number of elements. **Returns:** The total number of elements in the NDBuffer. ### `__str__` `__str__(self) -> String` Gets the buffer as a string. **Returns:** A compact string of the buffer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this buffer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__repr__` `__repr__(self) -> String` Gets the buffer as a string. **Returns:** A compact string representation of the buffer. ### `tile` `tile[*tile_sizes: Dim](self, tile_coords: IndexList[rank, element_type=element_type]) -> NDBuffer[type, rank, origin, DimList(VariadicList(tile_sizes)), address_space=address_space]` Returns an n-d tile "slice" of the buffer of size tile\_sizes at coords. **Parameters:** * ​\*tile\_sizes (`Dim`): The size of the tiles. **Args:** * ​tile\_coords (`IndexList[rank, element_type=element_type]`): The tile index. **Returns:** The tiled buffer at tile\_coords. ### `load` `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, *idx: Int) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​\*idx (`Int`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: VariadicList[Int]) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`VariadicList[Int]`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: IndexList[size, element_type=element_type]) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`IndexList[size, element_type=element_type]`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. `load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: StaticTuple[Int, rank]) -> SIMD[type, width]` Loads a simd value from the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​width (`Int`): The simd\_width of the load. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`StaticTuple[Int, rank]`): The index into the NDBuffer. **Returns:** The simd value starting at the `idx` position and ending at `idx+width`. ### `store` `store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, width])` Stores a simd value into the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​\_alignment (`Int`): The inferred alignment of self. * ​width (`Int`): The width of the simd vector. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer. * ​val (`SIMD[type, width]`): The value to store. `store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: StaticTuple[Int, rank], val: SIMD[type, width])` Stores a simd value into the buffer at the specified index. **Constraints:** The buffer must be contiguous or width must be 1. **Parameters:** * ​\_alignment (`Int`): The inferred alignment of self. * ​width (`Int`): The width of the simd vector. * ​alignment (`Int`): The alignment value. **Args:** * ​idx (`StaticTuple[Int, rank]`): The index into the buffer. * ​val (`SIMD[type, width]`): The value to store. ### `dim` `dim[index: Int](self) -> Int` Gets the buffer dimension at the given index. **Parameters:** * ​index (`Int`): The number of dimension to get. **Returns:** The buffer size at the given dimension. `dim(self, index: Int) -> Int` Gets the buffer dimension at the given index. **Args:** * ​index (`Int`): The number of dimension to get. **Returns:** The buffer size at the given dimension. ### `stride` `stride[index: Int](self) -> Int` Gets the buffer stride at the given index. **Parameters:** * ​index (`Int`): The number of dimension to get the stride for. **Returns:** The stride at the given dimension. `stride(self, index: Int) -> Int` Gets the buffer stride at the given index. **Args:** * ​index (`Int`): The number of dimension to get the stride for. **Returns:** The stride at the given dimension. ### `is_contiguous` `is_contiguous(self) -> Bool` Checks if the buffer is contiguous in memory. **Returns:** True if the buffer is contiguous in memory and False otherwise. ### `flatten` `flatten(self) -> NDBuffer[type, 1, origin, __init__[::Intable](shape.product()), address_space=address_space]` Constructs a flattened buffer counterpart for this NDBuffer. **Constraints:** The buffer must be contiguous. **Returns:** Constructed buffer object. ### `make_dims_unknown` `make_dims_unknown(self) -> NDBuffer[type, rank, origin, address_space=address_space]` Rebinds the NDBuffer to one with unknown shape. **Returns:** The rebound NDBuffer with unknown shape. ### `bytecount` `bytecount(self) -> Int` Returns the size of the NDBuffer in bytes. **Returns:** The size of the NDBuffer in bytes. ### `zero` `zero(self)` Sets all bytes of the NDBuffer to 0. **Constraints:** The buffer must be contiguous. ### `tofile` `tofile(self, path: Path)` Write values to a file. **Args:** * ​path (`Path`): Path to the output file. ### `fill` `fill(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], val: SIMD[type, 1])` Assigns val to all elements in the buffer. The fill is performed in chunks of size N, where N is the native SIMD width of type on the system. **Args:** * ​val (`SIMD[type, 1]`): The value to store. ### `stack_allocation` `static stack_allocation[*, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]() -> Self` Constructs an NDBuffer instance backed by stack allocated memory space. **Parameters:** * ​alignment (`Int`): Address alignment requirement for the allocation. **Returns:** Constructed NDBuffer with the allocated space. ### `prefetch` `prefetch[params: PrefetchOptions](self, *idx: Int)` Prefetches the data at the given index. **Parameters:** * ​params (`PrefetchOptions`): The prefetch configuration. **Args:** * ​\*idx (`Int`): The N-D index of the prefetched location. `prefetch[params: PrefetchOptions](self, indices: IndexList[rank])` Prefetches the data at the given index. **Parameters:** * ​params (`PrefetchOptions`): The prefetch configuration. **Args:** * ​indices (`IndexList[rank]`): The N-D index of the prefetched location. --- ## ndbuffer_reshape `ndbuffer_reshape[rank: Int, output_rank: Int, type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]` --- ## NDBufferMHAOperand `@register_passable(trivial)` `struct NDBufferMHAOperand[type_: DType, rank: Int, shape: DimList, stride: DimList]` An implementation for NDBuffer arguments to MHA kernels. ## Fields * ​buffer (`NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]`): ## Implemented traits `AnyType`, `Copyable`, `MHAOperand`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(buffer: NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]) -> Self` ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` --- ## neg_inf `neg_inf[dtype: DType]() -> SIMD[dtype, 1]` Gets a -inf value for the given dtype. **Constraints:** Can only be used for FP dtypes. **Parameters:** * ​dtype (`DType`): The value dtype. **Returns:** The -inf value of the given dtype. --- ## neon_intrinsics --- ## next_power_of_two `next_power_of_two(val: Int) -> Int` Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1. Notes: This operation is called `bit_ceil()` in C++. **Args:** * ​val (`Int`): The input value. **Returns:** The smallest power of 2 that is greater than or equal to the input value. `next_power_of_two(val: UInt) -> UInt` Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1. Notes: This operation is called `bit_ceil()` in C++. **Args:** * ​val (`UInt`): The input value. **Returns:** The smallest power of 2 that is greater than or equal to the input value. `next_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the smallest power of 2 that is greater than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 1 will be ceiled to 1. This operation is called `bit_ceil()` in C++. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is the smallest power of 2 that is greater than or equal to the integer at position `i` of the input value. --- ## nextafter `nextafter[dtype: DType, simd_width: Int](arg0: SIMD[dtype, simd_width], arg1: SIMD[dtype, simd_width]) -> SIMD[dtype, simd_width]` Computes next representable value of `arg0` in the direction of `arg1`. **Constraints:** The element dtype of the input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​simd\_width (`Int`): The width of the input and output SIMD vector. **Args:** * ​arg0 (`SIMD[dtype, simd_width]`): The first input argument. * ​arg1 (`SIMD[dtype, simd_width]`): The second input argument. **Returns:** The `nextafter` of the inputs. --- ## nms ## Structs * [​`BoundingBox`](./BoundingBox): ## Functions * [​`non_max_suppression`](./non_max_suppression): Buffer semantic overload. * [​`non_max_suppression_shape_func`](./non_max_suppression_shape_func): Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output. --- ## nn APIs to build neural network components for deep learning models with Python. ## Modules * [`conv`](/max/api/python/nn/conv) * [`embedding`](/max/api/python/nn/embedding) * [`kernels`](/max/api/python/nn/kernels) * [`layer`](/max/api/python/nn/layer) * [`linear`](/max/api/python/nn/linear) * [`rotary_embedding`](/max/api/python/nn/rotary_embedding) * [`sequential`](/max/api/python/nn/sequential) ## Packages * [`attention`](/max/api/python/nn/attention) * [`norm`](/max/api/python/nn/norm) * [`transformer`](/max/api/python/nn/transformer) * [`kv_cache`](/max/api/python/nn/kv_cache) --- ## nn Provides neural network operators for deep learning models. ## Modules * [​`activations`](./activations/): The module contains implementations of activation functions. * [​`arange`](./arange/): * [​`arg_nonzero`](./arg_nonzero/): * [​`argmaxmin`](./argmaxmin/): * [​`argmaxmin_gpu`](./argmaxmin_gpu/): * [​`argsort`](./argsort/): * [​`broadcast`](./broadcast/): * [​`concat`](./concat/): * [​`conv`](./conv/): * [​`conv_transpose`](./conv_transpose/): * [​`conv_utils`](./conv_utils/): * [​`cumsum`](./cumsum/): * [​`flash_attention`](./flash_attention/): * [​`fold`](./fold/): Implements the fold operation. * [​`fused_qk_rope`](./fused_qk_rope/): * [​`gather_scatter`](./gather_scatter/): * [​`image`](./image/): * [​`index_tensor`](./index_tensor/): * [​`irfft`](./irfft/): Inverse real FFT kernel using cuFFT. * [​`kv_cache`](./kv_cache/): * [​`kv_cache_ragged`](./kv_cache_ragged/): * [​`mha`](./mha/): * [​`mha_cross`](./mha_cross/): * [​`mha_mask`](./mha_mask/): * [​`mha_operand`](./mha_operand/): * [​`mha_score_mod`](./mha_score_mod/): * [​`mha_sm90`](./mha_sm90/): * [​`mha_tile_scheduler`](./mha_tile_scheduler/): * [​`mha_utils`](./mha_utils/): * [​`mla`](./mla/): * [​`moe`](./moe/): * [​`nms`](./nms/): * [​`normalization`](./normalization/): * [​`pad`](./pad/): * [​`pad_gpu`](./pad_gpu/): * [​`pool`](./pool/): * [​`rand_uniform`](./rand_uniform/): * [​`randn`](./randn/): * [​`repeat_interleave`](./repeat_interleave/): * [​`reshape`](./reshape/): * [​`resize`](./resize/): * [​`roi_align`](./roi_align/): * [​`sampling`](./sampling/): * [​`shapes`](./shapes/): * [​`slice`](./slice/): * [​`softmax`](./softmax/): * [​`split`](./split/): * [​`tile`](./tile/): * [​`topk`](./topk/): * [​`toppminp`](./toppminp/): * [​`toppminp_gpu`](./toppminp_gpu/): --- ## Node `struct Node[ElementType: Copyable & Movable]` A node in a linked list data structure. ## Parameters * ​ElementType (`Copyable & Movable`): The type of element stored in the node. ## Fields * ​value (`ElementType`): The value stored in this node. * ​prev (`UnsafePointer[Node[ElementType]]`): The previous node in the list. * ​next (`UnsafePointer[Node[ElementType]]`): The next node in the list. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, owned value: ElementType, prev: Optional[UnsafePointer[Node[ElementType]]], next: Optional[UnsafePointer[Node[ElementType]]])` Initialize a new Node with the given value and optional prev/next pointers. **Args:** * ​value (`ElementType`): The value to store in this node. * ​prev (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the previous node. * ​next (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the next node. ### `__str__` `__str__[ElementType: Copyable & Movable & Writable](self: Node[ElementType]) -> String` Convert this node's value to a string representation. **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if `ElementType` is `Writable`. **Returns:** String representation of the node's value. ### `write_to` `write_to[ElementType: Copyable & Movable & Writable, W: Writer](self: Node[ElementType], mut writer: W)` Write this node's value to the given writer. **Parameters:** * ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if `ElementType` is `Writable`. * ​W (`Writer`): The type of writer to write the value to. **Args:** * ​writer (`W`): The writer to write the value to. --- ## non_max_suppression `non_max_suppression[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[int64, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])` Buffer semantic overload. `non_max_suppression[: origin.set, //, type: DType, func: fn(SIMD[int64, 1], SIMD[int64, 1], SIMD[int64, 1]) capturing -> None](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])` Implements the NonMaxSuppression operator from the ONNX spec . --- ## non_max_suppression_shape_func `non_max_suppression_shape_func[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1]) -> IndexList[2]` Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output. --- ## none Defines the builtin `NoneType`. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`NoneType`](/mojo/stdlib/builtin/none/NoneType): Represents the absence of a value. --- ## none_true `none_true(src: NDBuffer[type, 1, origin]) -> Bool` Returns True if none of the elements in a buffer are True and False otherwise. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** True if none of the elements of the buffer are True and False otherwise. --- ## NoneType `@register_passable(trivial)` `struct NoneType` Represents the absence of a value. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Construct an instance of the `None` type. `@implicit` `__init__(value: None) -> Self` Construct an instance of the `None` type. **Args:** * ​value (`None`): The MLIR none type to construct from. ### `copy` `copy(self) -> Self` Explicit copy constructor. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Returns the string representation of `None`. **Returns:** `"None"`. ### `__repr__` `__repr__(self) -> String` Returns the string representation of `None`. **Returns:** `"None"`. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write `None` to a writer stream. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. --- ## NoPartition `@register_passable(trivial)` `struct NoPartition[dtype: DType]` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAPartitionScheme`, `Movable`, `UnknownDestructibility` ## Aliases ### `accum_dtype` `alias accum_dtype = dtype` ### `do_partition` `alias do_partition = False` ## Methods ### `__init__` `__init__() -> Self` ### `num_partitions` `num_partitions(self) -> SIMD[uint32, 1]` ### `get_exp_sum_qk_max_pointer` `get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]` --- ## norm ## Modules * [`group_norm`](/max/api/python/nn/norm/group_norm) * [`layer_norm`](/max/api/python/nn/norm/layer_norm) * [`rms_norm`](/max/api/python/nn/norm/rms_norm) --- ## normalization ## Functions * [​`block_reduce`](./block_reduce): * [​`layer_norm`](./layer_norm): * [​`layer_norm_cpu`](./layer_norm_cpu): Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$. * [​`layer_norm_gpu`](./layer_norm_gpu): * [​`layer_norm_gpu_block`](./layer_norm_gpu_block): * [​`layer_norm_gpu_warp_tiling`](./layer_norm_gpu_warp_tiling): * [​`layer_norm_reshape`](./layer_norm_reshape): * [​`layer_norm_shape`](./layer_norm_shape): Compute the output shape of a `layer_norm` operation. * [​`rms_norm`](./rms_norm): * [​`rms_norm_cpu`](./rms_norm_cpu): * [​`rms_norm_gpu`](./rms_norm_gpu): * [​`rms_norm_gpu_block`](./rms_norm_gpu_block): * [​`rms_norm_gpu_warp_tiling`](./rms_norm_gpu_warp_tiling): * [​`rms_norm_shape`](./rms_norm_shape): * [​`welford_block_all_reduce`](./welford_block_all_reduce): * [​`welford_combine`](./welford_combine): * [​`welford_update`](./welford_update): * [​`welford_warp_all_reduce`](./welford_warp_all_reduce): * [​`welford_warp_reduce`](./welford_warp_reduce): --- ## normalize `normalize(value: SIMD[bfloat16, 1]) -> SIMD[uint16, 1]` `normalize(value: SIMD[int32, 1]) -> SIMD[uint32, 1]` `normalize(value: SIMD[uint16, 1]) -> SIMD[uint16, 1]` `normalize(value: SIMD[float32, 1]) -> SIMD[uint32, 1]` `normalize(value: SIMD[dtype, 1]) -> SIMD[_uint_type_of_width[::Int](), 1]` Normalize the value to the appropriate unsigned integer type. This is needed for radix sort to work correctly. --- ## normalize_neg_index `normalize_neg_index(idx: Int, dim_size: Int) -> Int` Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer. Returns val + dim if val `normalize_neg_index[type: DType, width: Int, out_type: DType = index](idx: SIMD[type, width], dim_size: Int) -> SIMD[out_type, width]` Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer. Returns val + dim if val --- ## normalize_u32 `normalize_u32(value: SIMD[uint32, 1]) -> SIMD[uint32, 1]` --- ## NullMask `@register_passable(trivial)` `struct NullMask` Mask that's effectively a noop. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## num_logical_cores `num_logical_cores() -> Int` Returns the number of hardware threads, including hyperthreads across all CPU sockets. **Returns:** Int: The number of threads on the system. --- ## num_matrix_reg `num_matrix_reg[dim_1: Int, dim_2: Int]() -> Int` Calculates the number of matrix registers required per thread. Determines how many registers each thread in a warp needs to store a matrix of the given dimensions. This is calculated by dividing the total number of elements (dim\_1 \* dim\_2) by the warp size, as the matrix is distributed across all threads in the warp. **Parameters:** * ​dim\_1 (`Int`): First dimension of the matrix. * ​dim\_2 (`Int`): Second dimension of the matrix. **Returns:** The number of matrix registers needed per thread. --- ## num_performance_cores `num_performance_cores() -> Int` Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores. **Returns:** Int: The number of physical performance cores on the system. --- ## num_physical_cores `num_physical_cores() -> Int` Returns the number of physical cores across all CPU sockets. **Returns:** Int: The number of physical cores on the system. --- ## numerics Defines utilities to work with numeric types. You can import these APIs from the `utils` package. For example: ```mojo from utils.numerics import FPUtils ``` ## Structs * [​`FlushDenormals`](/mojo/stdlib/utils/numerics/FlushDenormals): Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit. * [​`FPUtils`](/mojo/stdlib/utils/numerics/FPUtils): Collection of utility functions for working with FP values. ## Functions * [​`get_accum_type`](/mojo/stdlib/utils/numerics/get_accum_type): Returns the recommended dtype for accumulation operations. * [​`inf`](/mojo/stdlib/utils/numerics/inf): Gets a +inf value for the given dtype. * [​`isfinite`](/mojo/stdlib/utils/numerics/isfinite): Checks if the value is not infinite. * [​`isinf`](/mojo/stdlib/utils/numerics/isinf): Checks if the value is infinite. * [​`isnan`](/mojo/stdlib/utils/numerics/isnan): Checks if the value is Not a Number (NaN). * [​`max_finite`](/mojo/stdlib/utils/numerics/max_finite): Returns the maximum finite value of type. * [​`max_or_inf`](/mojo/stdlib/utils/numerics/max_or_inf): Returns the maximum (potentially infinite) value of type. * [​`min_finite`](/mojo/stdlib/utils/numerics/min_finite): Returns the minimum (lowest) finite value of type. * [​`min_or_neg_inf`](/mojo/stdlib/utils/numerics/min_or_neg_inf): Returns the minimum (potentially negative infinite) value of type. * [​`nan`](/mojo/stdlib/utils/numerics/nan): Gets a NaN value for the given dtype. * [​`neg_inf`](/mojo/stdlib/utils/numerics/neg_inf): Gets a -inf value for the given dtype. * [​`nextafter`](/mojo/stdlib/utils/numerics/nextafter): Computes next representable value of `arg0` in the direction of `arg1`. --- ## nvml Implements wrappers around the NVIDIA Management Library (nvml). ## Modules * [​`nvml`](./nvml/): Implements wrappers around the NVIDIA Management Library (nvml). --- ## nvml Implements wrappers around the NVIDIA Management Library (nvml). ## Aliases ### `CUDA_NVML_LIBRARY` `alias CUDA_NVML_LIBRARY = _Global[__init__[__mlir_type.!kgen.string]("CUDA_NVML_LIBRARY"), _OwnedDLHandle, _init_dylib]` ### `CUDA_NVML_LIBRARY_BASE_NAME` `alias CUDA_NVML_LIBRARY_BASE_NAME = "libnvidia-ml"` ### `CUDA_NVML_LIBRARY_DIR` `alias CUDA_NVML_LIBRARY_DIR = __init__[__mlir_type.!kgen.string]("/usr/lib/x86_64-linux-gnu")` ### `CUDA_NVML_LIBRARY_EXT` `alias CUDA_NVML_LIBRARY_EXT = ".so"` ## Structs * [​`ClockType`](./ClockType): * [​`Device`](./Device): * [​`DriverVersion`](./DriverVersion): * [​`EnableState`](./EnableState): * [​`Result`](./Result): --- ## Occupancy In GPU programming, occupancy is a measure of the efficiency of the GPU's compute resources. It is defined as the ratio of the number of active [warps](warp.mdx) to the maximum number of warps that can be active on a given [streaming multiprocessor](streaming-multiprocessor.mdx) (SM) at any one time. Higher occupancy can improve parallel execution and hide memory latency, but increasing occupancy does not always boost performance, as factors like memory bandwidth and instruction dependencies may create bottlenecks. The optimal occupancy level depends on the workload and GPU architecture. --- ## oct `oct(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String` Returns the octal string representation of the given integer. The octal representation is a base-8 encoding of the integer value. The returned string will be prefixed with "0o" to indicate that the subsequent digits are octal. **Args:** * ​value (`SIMD[dtype, 1]`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the octal representation of the given integer. `oct[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String` Returns the octal string representation of the given integer. The octal representation is a base-8 encoding of the integer value. The returned string will be prefixed with "0o" to indicate that the subsequent digits are octal. **Parameters:** * ​T (`Intable`): The intable type to represent in octal. **Args:** * ​value (`T`): The integer value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the octal representation of the given integer. `oct(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String` Returns the octal string representation of the given scalar bool. The octal representation is a base-8 encoding of the bool. The returned string will be prefixed with "0o" to indicate that the subsequent digits are octal. **Args:** * ​value (`SIMD[bool, 1]`): The bool value to format. * ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int. **Returns:** A string containing the octal representation of the given bool. --- ## Offline inference import TutorialStack from '@site/src/components/TutorialStack'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; Offline inference with MAX allows you to run large language models directly in Python without relying on external API endpoints. This is in contrast to online inference, where you would send requests to a remote service. ## When to use offline inference You'll want to use offline inference in scenarios where you want to perform model inference without the need for a separate model inference server. Typically this includes where you have to process a batch of inputs concurrently. This approach is beneficial for tasks that require high throughput and can be executed in a controlled environment, such as data preprocessing, model evaluation, or when working with large datasets that need to be processed in batches. ## How offline inference works The core of offline inference revolves around the the [`LLM`](/max/api/python/entrypoints#max.entrypoints.llm.LLM) class which provides a Python interface to load and run language models. Specify the model from a Hugging Face repository or a local path and MAX handles the process of downloading the model. The [`PipelineConfig`](/max/api/python/pipelines/config/#max.pipelines.lib.config.PipelineConfig) class allows you to specify parameters related to the inference pipeline, such as [`max_length`](/max/api/python/pipelines/config/#max.pipelines.lib.config.PipelineConfig.max_length) and [`max_num_steps`](/max/api/python/pipelines/config/#max.pipelines.lib.config.PipelineConfig.max_num_steps). The [`generate()`](/max/api/python/entrypoints#max.entrypoints.llm.LLM.generate) function is used to generate text from the model. :::note The Python API for offline inference currently supports text-only input and does not support multi-modal models. If you need to work with vision capabilities, see the tutorial on [Generate image descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision). ::: ## Quickstart This quickstart demonstrates how to use offline inference using a Hugging Face model with MAX in Python. 1. Set up your project: 2. Create a file named `main.py` with the following code: ```python from max.entrypoints.llm import LLM from max.pipelines import PipelineConfig def main(): model_path = "modularai/Llama-3.1-8B-Instruct-GGUF" pipeline_config = PipelineConfig(model_path=model_path) llm = LLM(pipeline_config) prompts = [ "In the beginning, there was", "I believe the meaning of life is", "The fastest way to learn python is", ] print("Generating responses...") responses = llm.generate(prompts, max_new_tokens=50) for i, (prompt, response) in enumerate(zip(prompts, responses)): print(f"========== Response {i} ==========") print(prompt + response) print() if __name__ == "__main__": main() ``` This script downloads the [`modularai/Llama-3.1-8B-Instruct-GGUF`](https://huggingface.co/modularai/Llama-3.1-8B-Instruct-GGUF) model (if not already downloaded) and then run inference locally. While the initial model download requires internet access, the actual inference process is self-contained and does not send requests to a remote service for generating text. You can update the script to use a different model or modify the prompts to generate different responses. For a list of available models, see our [Model repository](https://builds.modular.com/?category=models). We chose the Llama-3.1-8B-Instruct-GGUF model for this example because it's not gated, meaning it's freely available without requiring special access permissions or authentication. For offline inference, MAX supports models in GGUF format. This includes most generative LLMs with "Chat" modality, but the specific configuration parameters might vary between models. Always refer to the model's documentation for compatibility details and optimal configuration settings. 3. Run the script: ```sh python main.py ``` This command will download the model and generate responses for the prompts. You should see output like the following: ```output Generating responses... ========== Response 0 ========== In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's ========== Response 1 ========== I believe the meaning of life is to find your gift. The purpose of life is to give it away to others. I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others. I believe that the meaning of life is ========== Response 2 ========== The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python: 1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division. ``` ## Next steps For more information on offline inference, see the following: - [Offline inference example](https://github.com/modular/modular/blob/main/examples/offline-inference/basic.py) - [Offline inference recipe](https://builds.modular.com/recipes/max-offline-inference) export const tutorials = [ 'deploy-llama-vision', 'run-embeddings-with-max-serve', ]; --- ## open `open[PathLike: PathLike](path: PathLike, mode: StringSlice[origin]) -> FileHandle` Opens the file specified by path using the mode provided, returning a FileHandle. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file to open. * ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w"). **Returns:** A file handle. --- ## Operators, expressions, and dunder methods Mojo includes a variety of operators for manipulating values of different types. Generally, the operators are equivalent to those found in Python, though many operators also work with additional Mojo types such as `SIMD` vectors. Additionally, Mojo allows you to define the behavior of most of these operators for your own custom types by implementing special *dunder* (double underscore) methods. This document contains the following three sections: - [Operators and expressions](#operators-and-expressions) discusses Mojo's built-in operators and how they work with commonly used Mojo types. - [Implement operators for custom types](#implement-operators-for-custom-types) describes the dunder methods that you can implement to support using operators with custom structs that you create. - [An example of implementing operators for a custom type](#an-example-of-implementing-operators-for-a-custom-type) shows a progressive example of writing a custom struct with support for several operators. ## Operators and expressions This section lists the operators that Mojo supports, their order or precedence and associativity, and describes how these operators behave with several commonly used built-in types. ### Operator precedence and associativity The table below lists the various Mojo operators, along with their order of precedence and associativity (also referred to as grouping). This table lists operators from the highest precedence to the lowest precedence. | **Operators** | **Description** | **Associativity (Grouping)** | | ------------- | --------------- | ----------------- | | `()` | Parenthesized expression | Left to right | | `x[index]`, `x[index:index]` | Subscripting, slicing | Left to right | | `**` | Exponentiation | Right to left | | `+x`, `-x`, `~x` | Positive, negative, bitwise NOT | Right to left | | `*`, `@`, `/`, `//`, `%` | Multiplication, matrix, division, floor division, remainder | Left to right | | `+`, `–` | Addition and subtraction | Left to right | | `>` | Shifts | Left to right | | `&` | Bitwise AND | Left to right | | `^` | Bitwise XOR | Left to right | | `\|` | Bitwise OR | Left to right | | `in`, `not in`, `is`, `is not`, ``, `>=`, `!=`, `==` | Comparisons, membership tests, identity tests | Left to Right | | `not x` | Boolean NOT | Right to left | | `x and y` | Boolean AND | Left to right | | `x or y` | Boolean OR | Left to right | | `if-else` | Conditional expression | Right to left | | `:=` | Assignment expression (walrus operator) | Right to left | Mojo supports the same operators as Python (plus a few extensions), and they have the same precedence levels. For example, the following arithmetic expression evaluates to 40: ```mojo 5 + 4 * 3 ** 2 - 1 ``` It is equivalent to the following parenthesized expression to explicitly control the order of evaluation: ```mojo (5 + (4 * (3 ** 2))) - 1 ``` Associativity defines how operators of the same precedence level are grouped into expressions. The table indicates whether operators of a given level are left- or right-associative. For example, multiplication and division are left associative, so the following expression results in a value of 3: ```mojo 3 * 4 / 2 / 2 ``` It is equivalent to the following parenthesized expression to explicitly control the order of evaluation: ```mojo ((3 * 4) / 2) / 2 ``` Whereas in the following, exponentiation operators are right associative resulting in a value of 264,144: ```mojo 4 ** 3 ** 2 ``` It is equivalent to the following parenthesized expression to explicitly control the order of evaluation: ```mojo 4 ** (3 ** 2) ``` :::note Mojo also uses the caret (`^`) as the [*transfer sigil*](/mojo/manual/values/ownership#transfer-arguments-owned-and-). In expressions where its use might be ambiguous, Mojo treats the character as the bitwise XOR operator. For example, `x^+1` is treated as `(x)^(+1)`. ::: ### Arithmetic and bitwise operators [Numeric types](/mojo/manual/types#numeric-types) describes the different numeric types provided by the Mojo standard library. The arithmetic and bitwise operators have slightly different behavior depending on the types of values provided. #### `Int` and `UInt` values The [`Int`](/mojo/stdlib/builtin/int/Int) and [`UInt`](/mojo/stdlib/builtin/uint/UInt) types represent signed and unsigned integers of the [word size](https://en.wikipedia.org/wiki/Word_(computer_architecture)) of the CPU, typically 64 bits or 32 bits. The `Int` and `UInt` types support all arithmetic operators except matrix multiplication (`@`), as well as all bitwise and shift operators. If both operands to a binary operator are `Int` values the result is an `Int`, if both operands are `UInt` values the result is a `UInt`, and if one operand is `Int` and the other `UInt` the result is an `Int`. The one exception for these types is true division, `/`, which always returns a `Float64` type value. ```mojo var a_int: Int = -7 var b_int: Int = 4 sum_int = a_int + b_int # Result is type Int print("Int sum:", sum_int) var i_uint: UInt = 9 var j_uint: UInt = 8 sum_uint = i_uint + j_uint # Result is type UInt print("UInt sum:", sum_uint) sum_mixed = a_int + i_uint # Result is type Int print("Mixed sum:", sum_mixed) quotient_int = a_int / b_int # Result is type Float64 print("Int quotient:", quotient_int) quotient_uint = i_uint / j_uint # Result is type Float64 print("UInt quotient:", quotient_uint) ``` ```output Int sum: -3 UInt sum: 17 Mixed sum: 2 Int quotient: -1.75 UInt quotient: 1.125 ``` #### `SIMD` values The Mojo standard library defines the [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type to represent a fixed-size array of values that can fit into a processor's register. This allows you to take advantage of [single instruction, multiple data](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) operations in hardware to efficiently process multiple values in parallel. `SIMD` values of a numeric [`DType`](/mojo/stdlib/builtin/dtype/DType) support all arithmetic operators except for matrix multiplication (`@`), though the left shift (`>`) operators support only integral types. Additionally, `SIMD` values of an integral or boolean type support all bitwise operators. `SIMD` values apply the operators in an *elementwise* fashion, as shown in the following example: ```mojo simd1 = SIMD[DType.int32, 4](2, 3, 4, 5) simd2 = SIMD[DType.int32, 4](-1, 2, -3, 4) simd3 = simd1 * simd2 print(simd3) ``` ```output [-2, 6, -12, 20] ``` [`Scalar`](/mojo/stdlib/builtin/simd/) values are simply aliases for single-element `SIMD` vectors, so `Float16` is just an alias for `SIMD[DType.float16, 1]`. Therefore `Scalar` values support the same set of arithmetic and bitwise operators. ```mojo var f1: Float16 = 2.5 var f2: Float16 = -4.0 var f3 = f1 * f2 # Implicitly of type Float16 print(f3) ``` ```output -10.0 ``` When using these operators on `SIMD` values, Mojo requires both to have the same size and `DType`, and the result is a `SIMD` of the same size and `DType`. The operators do *not* automatically widen lower precision `SIMD` values to higher precision. This means that the `DType` of each value must be the same or else the result is a compilation error. ```mojo var i8: Int8 = 8 var f64: Float64 = 64.0 result = i8 * f64 ``` ```output error: invalid call to '__mul__': could not deduce parameter 'type' of parent struct 'SIMD' result = i8 * f64 ~~~^~~~~ ``` If you need to perform an arithmetic or bitwise operator on two `SIMD` values of different types, you can explicitly convert a value to the desired type either by invoking its [`cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method or by passing it as an argument to the constructor of the target type. ```mojo simd1 = SIMD[DType.float32, 4](2.2, 3.3, 4.4, 5.5) simd2 = SIMD[DType.int16, 4](-1, 2, -3, 4) simd3 = simd1 * simd2.cast[DType.float32]() # Convert with cast() method print("simd3:", simd3) simd4 = simd2 + SIMD[DType.int16, 4](simd1) # Convert with SIMD constructor print("simd4:", simd4) ``` ```output simd3: [-2.2, 6.6, -13.200001, 22.0] simd4: [1, 5, 1, 9] ``` One exception is that the exponentiation operator, `**`, is overloaded so that you can specify an `Int` type exponent. All values in the `SIMD` are exponentiated to the same power. ```mojo base_simd = SIMD[DType.float64, 4](1.1, 2.2, 3.3, 4.4) var power: Int = 2 pow_simd = base_simd ** power # Result is SIMD[DType.float64, 4] print(pow_simd) ``` ```output [1.2100000000000002, 4.8400000000000007, 10.889999999999999, 19.360000000000003] ``` There are three operators related to division: - `/`, the "true division" operator, performs floating point division for `SIMD` values with a floating point `DType`. For `SIMD` values with an integral `DType`, true division *truncates* the quotient to an integral result. ```mojo num_float16 = SIMD[DType.float16, 4](3.5, -3.5, 3.5, -3.5) denom_float16 = SIMD[DType.float16, 4](2.5, 2.5, -2.5, -2.5) num_int32 = SIMD[DType.int32, 4](5, -6, 7, -8) denom_int32 = SIMD[DType.int32, 4](2, 3, -4, -5) # Result is SIMD[DType.float16, 4] true_quotient_float16 = num_float16 / denom_float16 print("True float16 division:", true_quotient_float16) # Result is SIMD[DType.int32, 4] true_quotient_int32 = num_int32 / denom_int32 print("True int32 division:", true_quotient_int32) ``` ```output True float16 division: [1.4003906, -1.4003906, -1.4003906, 1.4003906] True int32 division: [2, -2, -1, 1] ``` - `//`, the "floor division" operator, performs division and *rounds down* the result to the nearest integer. The resulting `SIMD` is still the same type as the original operands. For example: ```mojo # Result is SIMD[DType.float16, 4] var floor_quotient_float16 = num_float16 // denom_float16 print("Floor float16 division:", floor_quotient_float16) # Result is SIMD[DType.int32, 4] var floor_quotient_int32 = num_int32 // denom_int32 print("Floor int32 division:", floor_quotient_int32) ``` ```output Floor float16 division: [1.0, -2.0, -2.0, 1.0] Floor int32 division: [2, -2, -2, 1] ``` - `%`, the modulo operator, returns the remainder after dividing the numerator by the denominator an integral number of times. The relationship between the `//` and `%` operators can be defined as `num == denom * (num // denom) + (num % denom)`. For example: ```mojo # Result is SIMD[DType.float16, 4] var remainder_float16 = num_float16 % denom_float16 print("Modulo float16:", remainder_float16) # Result is SIMD[DType.int32, 4] var remainder_int32 = num_int32 % denom_int32 print("Modulo int32:", remainder_int32) print() # Result is SIMD[DType.float16, 4] var result_float16 = denom_float16 * floor_quotient_float16 + remainder_float16 print("Result float16:", result_float16) # Result is SIMD[DType.int32, 4] var result_int32 = denom_int32 * floor_quotient_int32 + remainder_int32 print("Result int32:", result_int32) ``` ```output Modulo float16: [1.0, 1.5, -1.5, -1.0] Modulo int32: [1, 0, -1, -3] Result float16: [3.5, -3.5, 3.5, -3.5] Result int32: [5, -6, 7, -8] ``` #### `IntLiteral` and `FloatLiteral` values [`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral) and [`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral) are compile-time, numeric values. When they are used in a compile-time context, they are arbitrary-precision values. When they are used in a run-time context, they are materialized as `Int` and `Float64` type values, respectively. As an example, the following code causes a compile-time error because the calculated `IntLiteral` value is too large to store in an `Int` variable: ```mojo alias big_int = (1 `, and `>=`. However their behavior depends on the type of values being compared. - `Int`, `UInt`, `IntLiteral`, and any type that can be implicitly converted to `Int` or `UInt` do standard numerical comparison with a `Bool` result. - Two `SIMD` values can be compared only if they are the same `DType` and size. (If you need to compare two `SIMD` values of different types, you can explicitly convert a value so that they have the same type either by invoking its [`cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method or by passing it as an argument to the constructor of the target type.) Mojo performs elementwise comparison with a `SIMD[DType.bool]` result. For example: ```mojo simd1 = SIMD[DType.int16, 4](-1, 2, -3, 4) simd2 = SIMD[DType.int16, 4](0, 1, 2, 3) simd3 = simd1 > simd2 # SIMD[DType.bool, 4] print(simd3) ``` ```output [False, True, False, True] ``` - An integral type `SIMD` can be compared to an `IntLiteral`, `Int`, `UInt`, or any type that can be implicitly converted to `Int` or `UInt`. Mojo performs elementwise comparison against the value provided and produces a `SIMD[DType.bool]` result. For example: ```mojo simd1 = SIMD[DType.int16, 4](-1, 2, -3, 4) simd2 = simd1 > 2 # SIMD[DType.bool, 4] print(simd2) ``` ```output [False, False, False, True] ``` - A floating point type `SIMD` can be compared to a `FloatLiteral`, `IntLiteral`, `Int`, `UInt`, or any type that can be implicitly converted to `Int` or `UInt`. Mojo performs elementwise comparison against the value provided and produces a `SIMD[DType.bool]` result. For example: ```mojo simd1 = SIMD[DType.float32, 4](1.1, -2.2, 3.3, -4.4) simd2 = simd1 > 0.5 # SIMD[DType.bool, 4] print(simd2) ``` ```output [True, False, True, False] ``` - `Scalar` values are simply aliases for single-element `SIMD` vectors. Therefore, the same restrictions apply against comparing different types. In other words, you can't compare a `Float16` value to a `Float32` value unless you convert the values to the same type. You can convert a `Scalar` value by passing it as an argument to the constructor of the target type: ```mojo var float1: Float16 = 12.345 # SIMD[DType.float16, 1] var float2: Float32 = 0.5 # SIMD[DType.float32, 1] result = Float32(float1) > float2 # Result is SIMD[DType.bool, 1] print(result) ``` ```output True ``` :::note Note that the result of comparing a `Scalar` value is a `SIMD[DType.bool, 1]`, which is not the same as a `Bool` value. However, `SIMD` values of size 1 implement the `Boolable` trait, which provides for implicit conversion to a `Bool` value when used in a boolean expression. ::: - `String` and `StringLiteral` values can be compared using standard lexicographical ordering, producing a `Bool`. (For example, "Zebra" is treated as less than "ant" because upper case letters occur before lower case letters in the character encoding.) String comparisons are discussed further in the [String operators](#string-operators) section below. Several other types in the Mojo standard library support various comparison operators, in particular the equality and inequality comparisons. Consult the [API documentation](/mojo/lib) for a type to determine whether any comparison operators are supported. ### String operators As discussed in [Strings](/mojo/manual/types#strings), the [`String`](/mojo/stdlib/collections/string/string/String) type represents a mutable string value. In contrast, the [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral) type represents a literal string that is embedded into your compiled program. At run-time a `StringLiteral` is loaded into memory as a constant that persists for the duration of your program's execution. The `String` type has a constructor that accepts a `StringLiteral` value, which means that a `StringLiteral` can be implicitly converted to a `String` at run-time if you pass it as an argument to a function or assign it to a `String` type variable. You also can use the [`String` constructor](/mojo/stdlib/collections/string/string/String#__init__) to explicitly convert the `StringLiteral` to a `String` value at run-time. Additionally, the [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) type is a non-copying view of a `String` or `StringLiteral` value. It provides some additional methods for string manipulation. A common use case is to create a `StringSlice` from a `StringLiteral` using the `StaticString` alias so that you can then invoke the `format()` method on it. For example: ```mojo alias message = StaticString('{} says, "{}"') name = "Pat" greeting = "Good day!" print(message.format(name, greeting)) ``` ```output Pat says, "Good day!" ``` #### String concatenation The `+` operator performs string concatenation. The `StringLiteral` type supports compile-time string concatenation. ```mojo alias last_name = "Curie" # Compile-time StringLiteral alias alias marie = "Marie " + last_name print(marie) # Compile-time concatenation assigned to a run-time StringLiteral type variable pierre = "Pierre " + last_name print(pierre) ``` ```output Marie Curie Pierre Curie ``` With the `String` type the `+` operator performs run-time string concatenation to produce a new `String` value. You can also concatenate a `String` and a `StringLiteral` to produce a new `String` result. ```mojo var first_name: String = "Grace" var last_name: String = " Hopper" # String type result programmer = first_name + last_name print(programmer) # String type result singer = first_name + " Slick" print(singer) ``` ```output Grace Hopper Grace Slick ``` :::tip When concatenating multiple values together to form a `String`, using the multi-argument `String()` constructor is more performant than using multiple `+` concatenation operators and can improve code readability. For example, instead of writing this: ```mojo result = "The point at (" + String(x) + ", " + String(y) + ")" ``` you can write: ```mojo result = String("The point at (", x, ", ", y, ")") ``` ::: #### String replication The `*` operator replicates a `String` a specified number of times. For example: ```mojo var str1: String = "la" str2 = str1 * 5 print(str2) ``` ```output lalalalala ``` `StringLiteral` supports the `*` operator for both compile-time and run-time string replication. The following examples perform compile-time string replication resulting in `StringLiteral` values: ```mojo alias divider1 = "=" * 40 alias symbol = "#" alias divider2 = symbol * 40 # You must define the following function using `fn` because an alias # initializer cannot call a function that can potentially raise an error. fn generate_divider(char: String, repeat: Int) -> String: return char * repeat alias divider3 = generate_divider("~", 40) # Evaluated at compile-time print(divider1) print(divider2) print(divider3) ``` ```output ======================================== ######################################## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` In contrast, the following examples perform run-time string replication resulting in `String` values: ```mojo repeat = 40 div1 = "^" * repeat print(div1) print("_" * repeat) ``` ```output ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ________________________________________ ``` #### String comparison `String` and `StringLiteral` values can be compared using standard lexicographical ordering, producing a `Bool`. For example, "Zebra" is treated as less than "ant" because upper case letters occur before lower case letters in the character encoding. ```mojo var animal: String = "bird" is_cat_eq = "cat" == animal print(StaticString('Is "cat" equal to "{}"?').format(animal), is_cat_eq) is_cat_ne = "cat" != animal print(StaticString('Is "cat" not equal to "{}"?').format(animal), is_cat_ne) is_bird_eq = "bird" == animal print(StaticString('Is "bird" equal to "{}"?').format(animal), is_bird_eq) is_cat_gt = "CAT" > animal print(StaticString('Is "CAT" greater than "{}"?').format(animal), is_cat_gt) is_ge_cat = animal >= "CAT" print(StaticString('Is "{}" greater than or equal to "CAT"?').format(animal), is_ge_cat) ``` ```output Is "cat" equal to "bird"? False Is "cat" not equal to "bird"? True Is "bird" equal to "bird"? True Is "CAT" greater than "bird"? False Is "bird" greater than or equal to "CAT"? True ``` #### Substring testing `String`, `StringLiteral`, and `StringSlice` support using the `in` operator to produce a `Bool` result indicating whether a given substring appears within another string. The operator is overloaded so that you can use any combination of `String` and `StringLiteral` for both the substring and the string to test. ```mojo var food: String = "peanut butter" if "nut" in food: print("It contains a nut") else: print("It doesn't contain a nut") ``` ```output It contains a nut ``` #### String indexing and slicing `String`, `StringLiteral`, and `StringSlice` allow you to use indexing to return a single character. Character positions are identified with a zero-based index starting from the first character. You can also specify a negative index to count backwards from the end of the string, with the last character identified by index -1. Specifying an index beyond the bounds of the string results in a run-time error. ```mojo alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # StringLiteral type value print(alphabet[0], alphabet[-1]) # The following would produce a run-time error # print(alphabet[45]) ``` ```output A Z ``` The `String` and `StringSlice` types—but *not* the `StringLiteral` type—also support slices to return a substring from the original `String`. Providing a slice in the form `[start:end]` returns a substring starting with the character index specified by `start` and continuing up to but not including the character at index `end`. You can use positive or negative indexing for both the start and end values. Omitting `start` is the same as specifying `0`, and omitting `end` is the same as specifying 1 plus the length of the string. ```mojo var alphabet: String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" print(alphabet[1:4]) # The 2nd through 4th characters print(alphabet[:6]) # The first 6 characters print(alphabet[-6:]) # The last 6 characters ``` ```output BCD ABCDEF UVWXYZ ``` You can also specify a slice with a `step` value, as in `[start:end:step]` indicating the increment between subsequent indices of the slide. (This is also sometimes referred to as a "stride.") If you provide a negative value for `step`, characters are selected in reverse order starting with `start` but then with *decreasing* index values up to but not including `end`. ```mojo print(alphabet[1:6:2]) # The 2nd, 4th, and 6th characters print(alphabet[-1:-4:-1]) # The last 3 characters in reverse order print(alphabet[::-1]) # The entire string reversed ``` ```output BDF ZYX ZYXWVUTSRQPONMLKJIHGFEDCBA ``` ### In-place assignment operators Mutable types that support binary arithmetic, bitwise, and shift operators typically support equivalent in-place assignment operators. That means that for a type that supports the `+` operator, the following two statements are essentially equivalent: ```mojo a = a + b a += b ``` However there is a subtle difference between the two. In the first example, the expression `a + b` produces a new value, which is then assigned to `a`. In contrast, the second example does an in-place modification of the value currently assigned to `a`. For register-passable types, the compiled results might be equivalent at run-time. But for a memory-only type, the first example allocates storage for the result of `a + b` and then assigns the value to the variable, whereas the second example can do an in-place modification of the existing value. :::note A type must explicitly implement in-place assignment methods, so you might encounter some types where in-place equivalents are not supported. ::: ### Assignment expressions The "walrus" operator, `:=`, allows you to assign a value to a variable within an expression. The value provided is both assigned to the variable and becomes the result of the expression. This often can simplify conditional or looping logic. For example, consider the following prompting loop: ```mojo while True: name = input("Enter a name or 'quit' to exit: ") if name == "quit": break print("Hello,", name) ``` ```output Enter a name or 'quit' to exit: Coco Hello, Coco Enter a name or 'quit' to exit: Vivienne Hello, Vivienne Enter a name or 'quit' to exit: quit ``` Using the walrus operator, you can implement the same behavior like this: ```mojo while (name := input("Enter a name or 'quit' to exit: ")) != "quit": print("Hello,", name) ``` ```output Enter a name or 'quit' to exit: Donna Hello, Donna Enter a name or 'quit' to exit: Vera Hello, Vera Enter a name or 'quit' to exit: quit ``` ## Implement operators for custom types When you create a custom struct, Mojo allows you to define the behavior of many of the built-in operators for that type by implementing special *dunder* (double underscore) methods. This section lists the dunder methods associated with the operators and briefly describes the requirements for implementing them. :::note Currently, Mojo doesn't support defining arbitrary custom operators (for example, `-^-`). You can define behaviors for only the operators listed in the following subsections. ::: ### Unary operator dunder methods A unary operator invokes an associated dunder method on the value to which it applies. The supported unary operators and their corresponding methods are shown in the table below. | **Operator** | **Dunder method** | | --------------- | ----------------- | | `+` positive | `__pos__()` | | `-` negative | `__neg__()` | | `~` bitwise NOT | `__invert__()` | For each of these methods that you decide to implement, you should return either the original value if unchanged, or a new value representing the result of the operator. For example, you could implement the `-` negative operator for a `MyInt` struct like this: ```mojo @value struct MyInt: var value: Int def __neg__(self) -> Self: return Self(-self.value) ``` ### Binary arithmetic, shift, and bitwise operator dunder methods When you have a binary expression like `a + b`, there are two possible dunder methods that could be invoked. Mojo first determines whether the left-hand side value (`a` in this example) has a "normal" version of the `+` operator's dunder method defined that accepts a value of the right-hand side's type. If so, it then invokes that method on the left-hand side value and passes the right-hand side value as an argument. If Mojo doesn't find a matching "normal" dunder method on the left-hand side value, it then checks whether the right-hand side value has a "reflected" (sometimes referred to as "reversed") version of the `+` operator's dunder method defined that accepts a value of the left-hand side's type. If so, it then invokes that method on the right-hand side value and passes the left-hand side value as an argument. For both the normal and the reflected versions, the dunder method should return a new value representing the result of the operator. Additionally, there are dunder methods corresponding to the in-place assignment versions of the operators. These methods receive the right-hand side value as an argument and the methods should modify the existing left-hand side value to reflect the result of the operator. The table below lists the various binary arithmetic, shift, and bitwise operators and their corresponding normal, reflected, and in-place dunder methods. | **Operator** | **Normal** | **Reflected** | **In-place** | | ------------ | ---------- | ------------- | ------------ | | `+` addition | `__add__()` | `__radd__()` | `__iadd__()` | | `-` subtraction | `__sub__()` | `__rsub__()` | `__isub__()` | | `*` multiplication | `__mul__()` | `__rmul__()` | `__imul__()` | | `/` division | `__truediv__()` | `__rtruediv__()` | `__itruediv__()` | | `//` floor division | `__floordiv__()` | `__rfloordiv__()` | `__ifloordiv__()` | | `%` modulus/remainder | `__mod__()` | `__rmod__()` | `__imod__()` | | `**` exponentiation | `__pow__()` | `__rpow__()` | `__ipow__()` | | `@` matrix multiplication | `__matmul__()` | `__rmatmul__()` | `__imatmul__()` | | `>` right shift | `__rshift__()` | `__rrshift__()` | `__irshift__()` | | `&` bitwise AND | `__and__()` | `__rand__()` | `__iand__()` | | `\|` bitwise OR | `__or__()` | `__ror__()` | `__ior__()` | | `^` bitwise XOR | `__xor__()` | `__rxor__()` | `__ixor__()` | As an example, consider implementing support for all of the `+` operator dunder methods for a custom `MyInt` struct. This shows supporting adding two `MyInt` instances as well as adding a `MyInt` and an `Int`. We can support the case of having the `Int` as the right-hand side argument by overloaded the definition of `__add__()`. But to support the case of having the `Int` as the left-hand side argument, we need to implement an `__radd__()` method, because the built-in `Int` type doesn't have an `__add__()` method that supports our custom `MyInt` type. ```mojo @value struct MyInt: var value: Int def __add__(self, rhs: MyInt) -> Self: return MyInt(self.value + rhs.value) def __add__(self, rhs: Int) -> Self: return MyInt(self.value + rhs) def __radd__(self, lhs: Int) -> Self: return MyInt(self.value + lhs) def __iadd__(mut self, rhs: MyInt) -> None: self.value += rhs.value def __iadd__(mut self, rhs: Int) -> None: self.value += rhs ``` ### Comparison operator dunder methods When you have a comparison expression like `a ` greater than | `__gt__()` | | `>=` greater than or equal | `__ge__()` | :::note The `Comparable` and `EqualityComparable` traits don't allow the comparison dunder methods to raise errors. Because using `def` to define a method implies that it can raise an error, you must use `fn` to implement the comparison methods declared by these traits. See [Functions](/mojo/manual/functions) for more information on the differences between defining functions with `def` and `fn`. ::: As an example, consider implementing support for all of the comparison operator dunder methods for a custom `MyInt` struct. ```mojo @value struct MyInt( Comparable ): var value: Int fn __eq__(self, rhs: MyInt) -> Bool: return self.value == rhs.value fn __ne__(self, rhs: MyInt) -> Bool: return self.value != rhs.value fn __lt__(self, rhs: MyInt) -> Bool: return self.value Bool: return self.value Bool: return self.value > rhs.value fn __ge__(self, rhs: MyInt) -> Bool: return self.value >= rhs.value ``` ### Membership operator dunder methods The `in` and `not in` operators depend on a type implementing the `__contains__()` dunder method. Typically only collection types (such as `List`, `Dict`, and `Set`) implement this method. It should accept the right-hand side value as an argument and return a `Bool` indicating whether the value is present in the collection or not. ### Subscript and slicing dunder methods Subscripting and slicing typically apply only to sequential collection types, like `List` and `String`. Subscripting references a single element of a collection or a dimension of a multi-dimensional container, whereas slicing refers to a range of values. A type supports both subscripting and slicing by implementing the `__getitem__()` method for retrieving values and the `__setitem__()` method for setting values. #### Subscripting In the simple case of a one-dimensional sequence, the `__getitem__()` and `__setitem__()` methods should have signatures similar to this: ```mojo struct MySeq[type: Copyable & Movable]: fn __getitem__(self, idx: Int) -> type: # Return element at the given index ... fn __setitem__(mut self, idx: Int, value: type): # Assign the element at the given index the provided value ``` It's also possible to support multi-dimensional collections, in which case you can implement both `__getitem__()` and `__setitem__()` methods to accept multiple index arguments—or even variadic index arguments for arbitrary—dimension collections. ```mojo struct MySeq[type: Copyable & Movable]: # 2-dimension support fn __getitem__(self, x_idx: Int, y_idx: Int) -> type: ... # Arbitrary-dimension support fn __getitem__(self, *indices: Int) -> type: ... ``` #### Slicing You provide slicing support for a collection type also by implementing `__getitem__()` and `__setitem__()` methods. But for slicing, instead of accepting an `Int` index (or indices, in the case of a multi-dimensional collection) you implement to methods to accept a [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice) (or multiple `Slice`s in the case of a multi-dimensional collection). ```mojo struct MySeq[type: Copyable & Movable]: # Return a new MySeq with a subset of elements fn __getitem__(self, span: Slice) -> Self: ... ``` A `Slice` contains three fields: - `start` (`Optional[Int]`): The starting index of the slice - `end` (`Optional[Int]`): The ending index of the slice - `step` (`Optional[Int]`): The step increment value of the slice. Because the start, end, and step values are all optional when using slice syntax, they are represented as `Optional[Int]` values in the `Slice`. And if present, the index values might be negative representing a relative position from the end of the sequence. As a convenience, `Slice` provides an `indices()` method that accepts a `length` value and returns a 3-tuple of "normalized" start, end, and step values for the given length, all represented as non-negative values. You can then use these normalized values to determine the corresponding elements of your collection being referenced. ```mojo struct MySeq[type: Copyable & Movable]: var size: Int # Return a new MySeq with a subset of elements fn __getitem__(self, span: Slice) -> Self: var start: Int var end: Int var step: Int start, end, step = span.indices(self.size) ... ``` ## An example of implementing operators for a custom type As an example of implementing operators for a custom Mojo type, let's create a `Complex` struct to represent a single complex number, with both the real and imaginary components stored as `Float64` values. We'll implement most of the arithmetic operators, the associated in-place assignment operators, the equality comparison operators, and a few additional convenience methods to support operations like printing complex values. We'll also allow mixing `Complex` and `Float64` values in arithmetic expressions to produce a `Complex` result. This example builds our `Complex` struct incrementally. You can also find the [complete example in the public GitHub repo](https://github.com/modular/modular/tree/main/examples/mojo/operators). :::note Note that the Mojo standard library implements a parameterized [`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD) struct that provides support for a basic set of arithmetic operators. However, our `Complex` type will not be based on the `ComplexSIMD` struct or be compatible with it. ::: ### Implement lifecycle methods Our `Complex` struct is an example of a simple value type consisting of trivial numeric fields and requiring no special constructor or destructor behaviors. This means that we can take advantage of Mojo's [`@value`](/mojo/manual/decorators/value) decorator, which is described in [Simple value types](/mojo/manual/lifecycle/life#value-decorator), to automatically implement a member-wise initializer (a constructor with arguments for each field), a copy constructor, a move constructor, and a destructor. ```mojo @value struct Complex(): var re: Float64 var im: Float64 ``` This definition is enough for us to create `Complex` instances and access their real and imaginary fields. ```mojo c1 = Complex(-1.2, 6.5) print(StaticString("c1: Real: {}; Imaginary: {}").format(c1.re, c1.im)) ``` ```output c1: Real: -1.2; Imaginary: 6.5 ``` As a convenience, let's add an explicit constructor to handle the case of creating a `Complex` instance with an imaginary component of 0. ```mojo @value struct Complex(): var re: Float64 var im: Float64 fn __init__(out self, re: Float64, im: Float64 = 0.0): self.re = re self.im = im ``` Now we can create a `Complex` instance and provide just a real component. ```mojo c2 = Complex(3.14159) print(StaticString("c2: Real: {}; Imaginary: {}").format(c2.re, c2.im)) ``` ```output c2: Real: 3.1415899999999999; Imaginary: 0.0 ``` ### Implement the `Writable` and `Stringable` traits To make it simpler to print `Complex` values, let's implement the [Writable](/mojo/stdlib/utils/write/Writable) trait. While we're at it, let's also implement the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait so that we can use the `String()` constructor to generate a `String` representation of a `Complex` value. You can find out more about these traits and their associated methods in [The `Stringable`, `Representable`, and `Writable` traits](/mojo/manual/traits#the-stringable-representable-and-writable-traits). ```mojo @value struct Complex( Writable, Stringable, ): # ... fn __str__(self) -> String: return String.write(self) fn write_to[W: Writer](self, mut writer: W): writer.write("(", self.re) if self.im Float64: if idx == 0: return self.re elif idx == 1: return self.im else: raise "index out of bounds" def __setitem__(mut self, idx: Int, value: Float64) -> None: if idx == 0: self.re = value elif idx == 1: self.im = value else: raise "index out of bounds" ``` Now let's try getting and setting the real and imaginary components of a `Complex` value using indexing. ```mojo c2 = Complex(3.14159) print(StaticString("c2[0]: {}; c2[1]: {}").format(c2[0], c2[1])) c2[0] = 2.71828 c2[1] = 42 print("c2[0] = 2.71828; c2[1] = 42; c2:", c2) ``` ```output c2[0]: 3.1415899999999999; c2[1]: 0.0 c2[0] = 2.71828; c2[1] = 42; c2: (2.71828 + 42.0i) ``` ### Implement arithmetic operators Now let's implement the dunder methods that allow us to perform arithmetic operations on `Complex` values. (Refer to the [Wikipedia page](https://en.wikipedia.org/wiki/Complex_number) on complex numbers for a more in-depth explanation of the formulas for these operators.) #### Implement basic operators for `Complex` values The unary `+` operator simply returns the original value, whereas the unary `-` operator returns a new `Complex` value with the real and imaginary components negated. ```mojo # ... def __pos__(self) -> Self: return self def __neg__(self) -> Self: return Self(-self.re, -self.im) ``` Let's test these out by printing the result of applying each operator. ```mojo c1 = Complex(-1.2, 6.5) print("+c1:", +c1) print("-c1:", -c1) ``` ```output +c1: (-1.2 + 6.5i) -c1: (1.2 - 6.5i) ``` Next we'll implement the basic binary operators: `+`, `-`, `*`, and `/`. Dividing complex numbers is a bit tricky, so we'll also define a helper method called `norm()` to calculate the [Euclidean norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm_of_complex_numbers) of a `Complex` instance, which can also be useful for other types of analysis with complex numbers. For all of these dunder methods, the left-hand side operand is `self` and the right-hand side operand is passed as an argument. We return a new `Complex` value representing the result. ```mojo from math import sqrt # ... def __add__(self, rhs: Self) -> Self: return Self(self.re + rhs.re, self.im + rhs.im) def __sub__(self, rhs: Self) -> Self: return Self(self.re - rhs.re, self.im - rhs.im) def __mul__(self, rhs: Self) -> Self: return Self( self.re * rhs.re - self.im * rhs.im, self.re * rhs.im + self.im * rhs.re ) def __truediv__(self, rhs: Self) -> Self: denom = rhs.squared_norm() return Self( (self.re * rhs.re + self.im * rhs.im) / denom, (self.im * rhs.re - self.re * rhs.im) / denom ) def squared_norm(self) -> Float64: return self.re * self.re + self.im * self.im def norm(self) -> Float64: return sqrt(self.squared_norm()) ``` Now we can try them out. ```mojo c1 = Complex(-1.2, 6.5) c3 = Complex(3.14159, -2.71828) print("c1 + c3 =", c1 + c3) print("c1 - c3 =", c1 - c3) print("c1 * c3 =", c1 * c3) print("c1 / c3 =", c1 / c3) ``` ```output c1 + c3 = (1.9415899999999999 + 3.78172i) c1 - c3 = (-4.3415900000000001 + 9.21828i) c1 * c3 = (13.898912000000001 + 23.682270999999997i) c1 / c3 = (-1.2422030701265261 + 0.99419218883955773i) ``` #### Implement overloaded arithmetic operators for `Float64` values Our initial set of binary arithmetic operators work fine if both operands are `Complex` instances. But if we have a `Float64` value representing just a real value, we'd first need to use it to create a `Complex` value before we could add, subtract, multiply, or divide it with another `Complex` value. If we think that this will be a common use case, it makes sense to overload our arithmetic methods to accept a `Float64` as the second operand. For the case where we have `complex1 + float1`, we can just create an overloaded definition of `__add__()`. But what about the case of `float1 + complex1`? By default, when Mojo encounters a `+` operator it tries to invoke the `__add__()` method of the left-hand operand, but the built-in `Float64` type doesn't implement support for addition with a `Complex` value. This is an example where we need to implement the `__radd__()` method on the `Complex` type. When Mojo can't find an `__add__(self, rhs: Complex) -> Complex` method defined on `Float64`, it uses the `__radd__(self, lhs: Float64) -> Complex` method defined on `Complex`. So we can support arithmetic operations on `Complex` and `Float64` values by implementing the following eight methods. ```mojo # ... def __add__(self, rhs: Float64) -> Self: return Self(self.re + rhs, self.im) def __radd__(self, lhs: Float64) -> Self: return Self(self.re + lhs, self.im) def __sub__(self, rhs: Float64) -> Self: return Self(self.re - rhs, self.im) def __rsub__(self, lhs: Float64) -> Self: return Self(lhs - self.re, -self.im) def __mul__(self, rhs: Float64) -> Self: return Self(self.re * rhs, self.im * rhs) def __rmul__(self, lhs: Float64) -> Self: return Self(lhs * self.re, lhs * self.im) def __truediv__(self, rhs: Float64) -> Self: return Self(self.re / rhs, self.im / rhs) def __rtruediv__(self, lhs: Float64) -> Self: denom = self.squared_norm() return Self( (lhs * self.re) / denom, (-lhs * self.im) / denom ) ``` Let's see them in action. ```mojo c1 = Complex(-1.2, 6.5) f1 = 2.5 print("c1 + f1 =", c1 + f1) print("f1 + c1 =", f1 + c1) print("c1 - f1 =", c1 - f1) print("f1 - c1 =", f1 - c1) print("c1 * f1 =", c1 * f1) print("f1 * c1 =", f1 * c1) print("c1 / f1 =", c1 / f1) print("f1 / c1 =", f1 / c1) ``` ```output c1 + f1 = (1.3 + 6.5i) f1 + c1 = (1.3 + 6.5i) c1 - f1 = (-3.7000000000000002 + 6.5i) f1 - c1 = (3.7000000000000002 - 6.5i) c1 * f1 = (-3.0 + 16.25i) f1 * c1 = (-3.0 + 16.25i) c1 / f1 = (-0.47999999999999998 + 2.6000000000000001i) f1 / c1 = (-0.068665598535133904 - 0.37193865873197529i) ``` #### Implement in-place assignment operators Now let's implement support for the in-place assignment operators: `+=`, `-=`, `*=`, and `/=`. These modify the original value, so we need to mark `self` as being an `mut` argument and update the `re` and `im` fields instead of returning a new `Complex` instance. And once again, we'll overload the definitions to support both a `Complex` and a `Float64` operand. ```mojo # ... def __iadd__(mut self, rhs: Self) -> None: self.re += rhs.re self.im += rhs.im def __iadd__(mut self, rhs: Float64) -> None: self.re += rhs def __isub__(mut self, rhs: Self) -> None: self.re -= rhs.re self.im -= rhs.im def __isub__(mut self, rhs: Float64) -> None: self.re -= rhs def __imul__(mut self, rhs: Self) -> None: new_re = self.re * rhs.re - self.im * rhs.im new_im = self.re * rhs.im + self.im * rhs.re self.re = new_re self.im = new_im def __imul__(mut self, rhs: Float64) -> None: self.re *= rhs self.im *= rhs def __itruediv__(mut self, rhs: Self) -> None: denom = rhs.squared_norm() new_re = (self.re * rhs.re + self.im * rhs.im) / denom new_im = (self.im * rhs.re - self.re * rhs.im) / denom self.re = new_re self.im = new_im def __itruediv__(mut self, rhs: Float64) -> None: self.re /= rhs self.im /= rhs ``` And now to try them out. ```mojo c4 = Complex(-1, -1) print("c4 =", c4) c4 += Complex(0.5, -0.5) print("c4 += Complex(0.5, -0.5) =>", c4) c4 += 2.75 print("c4 += 2.75 =>", c4) c4 -= Complex(0.25, 1.5) print("c4 -= Complex(0.25, 1.5) =>", c4) c4 -= 3 print("c4 -= 3 =>", c4) c4 *= Complex(-3.0, 2.0) print("c4 *= Complex(-3.0, 2.0) =>", c4) c4 *= 0.75 print("c4 *= 0.75 =>", c4) c4 /= Complex(1.25, 2.0) print("c4 /= Complex(1.25, 2.0) =>", c4) c4 /= 2.0 print("c4 /= 2.0 =>", c4) ``` ```output c4 = (-1.0 - 1.0i) c4 += Complex(0.5, -0.5) => (-0.5 - 1.5i) c4 += 2.75 => (2.25 - 1.5i) c4 -= Complex(0.25, 1.5) => (2.0 - 3.0i) c4 -= 3 => (-1.0 - 3.0i) c4 *= Complex(-3.0, 2.0) => (9.0 + 7.0i) c4 *= 0.75 => (6.75 + 5.25i) c4 /= Complex(1.25, 2.0) => (3.404494382022472 - 1.247191011235955i) c4 /= 2.0 => (1.702247191011236 - 0.6235955056179775i) ``` ### Implement equality operators The field of complex numbers is not an ordered field, so it doesn't make sense for us to implement the `Comparable` trait and the `>`, `>=`, ` Bool: return self.re == other.re and self.im == other.im fn __ne__(self, other: Self) -> Bool: return self.re != other.re and self.im != other.im ``` :::note The `EqualityComparable` trait doesn't allow the `__eq__()` and `__ne__()` methods to raise errors. Because defining a method with `def` implies that it can raise an error, we instead have to define these methods with `fn`. See [Functions](/mojo/manual/functions) for more information on the differences between defining functions with `def` and `fn`. ::: And now to try them out. ```mojo c1 = Complex(-1.2, 6.5) c3 = Complex(3.14159, -2.71828) c5 = Complex(-1.2, 6.5) if c1 == c5: print("c1 is equal to c5") else: print("c1 is not equal to c5") if c1 != c3: print("c1 is not equal to c3") else: print("c1 is equal to c3") ``` ```output c1 is equal to c5 c1 is not equal to c3 ``` --- ## ops Implements operations used when staging a graph. This module provides operations for building computational graphs in MAX. These operations create, transform, and manipulate tensor values within the graph. You can also use functions in [Graph](/max/api/python/graph/Graph) to add constant values to your graph with operations like [constant()](/max/api/python/graph/ops#max.graph.ops.constant). The [TensorValue](/max/api/python/graph/TensorValue/) type (returned by most operations) implements various dunder methods to support operations between TensorValues, such as + for addition, \* for multiplication, and @ for matrix multiplication. It also provides convenience methods like [reshape()](/max/api/python/graph/TensorValue/#max.graph.TensorValue.reshape) and [flatten()](/max/api/python/graph/TensorValue/#max.graph.TensorValue.flatten). ## Casting ### `broadcast_to()` {#max.graph.ops.broadcast_to} > max.graph.ops.broadcast\_to(x, shape, out\_dims=None) Broadcasts a symbolic tensor. Broadcasts the input tensor to the specified shape. Dimensions in the input must be one or match the target dimension. **Parameters:** * **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input symbolic tensor to broadcast. This tensor may not contain any dynamic dimensions. * **shape** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as a list of dimensions. Dynamic dimensions are not allowed. * **out\_dims** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` `|` `None` ) – Output dims used only for tensor-valued shape. **Returns:** A symbolic tensor with the same elements as the original tensor, but in a new shape. Its symbolic shape is the same as `shape`. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – if a tensor-valued shape is passed without out\_dims. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `cast()` {#max.graph.ops.cast} > max.graph.ops.cast(x, dtype) Casts a symbolic tensor to a different data type. **Parameters:** * **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to cast. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The target dtype to which the tensor is cast. **Returns:** A new symbolic tensor with the same shape as the input and the specified dtype. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `rebind()` {#max.graph.ops.rebind} > max.graph.ops.rebind(x, shape, message='') Rebinds a symbolic tensor to a specified set of dimensions. This does not mutate the symbolic tensor passed in, but instead adds a runtime assert that the input symbolic shape is equivalent to `out_dims` shape. For example, if the input tensor shape has dynamic/unknown sizes, this will assert a fixed sizes that may be required for a subsequent operation. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to rebind. * **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The symbolic shape to assert for `x`, as a list of [`Dim`](/max/api/python/graph/type/Dim) values. * **message** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The message printed if the rebind fails at runtime. **Returns:** A symbolic tensor with the same elements and shape as the given tensor, but with the symbolic shape asserted to `out_dims`. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `reshape()` {#max.graph.ops.reshape} > max.graph.ops.reshape(x, shape) Reshapes a symbolic tensor. The number and order of the elements in the tensor is unchanged. In other words, if you were to iterate over elements in the tensor by major dimension to minor dimension, the iteration order would stay the same. If a value of -1 is present in the shape, that dimension becomes an automatically calculated dimension collecting all unspecified dimensions. Its length becomes the number of elements in the original tensor divided by the product of elements of the reshape. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to reshape. This tensor may not contain any dynamic dimensions. * **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as a list of dimensions. Dynamic dimensions are not allowed. A single dimension may be -1. **Returns:** A symbolic tensor with the same elements as the original tensor, but in a new shape. Its symbolic shape is the same as `shape`. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – if input and target shapes’ number of elements mismatch. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `shape_to_tensor()` {#max.graph.ops.shape_to_tensor} > max.graph.ops.shape\_to\_tensor(shape) Converts a shape to a tensor. This is useful for using a shape attribute in an op that expects a tensor value. **Parameters:** **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – the shape attribute of a tensor value. **Returns:** The TensorValue containing the same value as shape. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### Example ```pycon >>> x = ops.constant(np.zeros((1,)), DType.int64, device=DeviceRef.CPU()) >>> result = ops.stack([ ... x, ... ops.shape_to_tensor(x.shape), ... ]) TensorValue(dtype=int64, shape=[StaticDim(dim=2), StaticDim(dim=1)]) ``` ### `squeeze()` {#max.graph.ops.squeeze} > max.graph.ops.squeeze(x, axis) Removes a size-1 dimension from a symbolic tensor. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to squeeze. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension to remove from the input’s shape. If negative, this indexes from the end of the tensor. For example, `squeeze(v, -1)` squeezes the last dimension. **Returns:** A symbolic tensor with the same number of elements as the input tensor, and whose rank is 1 less than the rank of the input tensor. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `transpose()` {#max.graph.ops.transpose} > max.graph.ops.transpose(x, axis\_1, axis\_2) Transposes two axes of a symbolic tensor. For more information, see [`transpose()`](TensorValue.md#max.graph.TensorValue.transpose). **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to transpose. * **axis\_1** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – One of the two axes to transpose. If negative, this indexes from the end of the tensor. For example, `transpose(v, -1, -2)` transposes the last two axes. * **axis\_2** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The other axis to transpose. May also be negative to index from the end of the tensor. **Returns:** A new symbolic tensor with the two specified axes transposed. It has the same elements and dtype, but the order of the elements is different according to the transposition. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `unsqueeze()` {#max.graph.ops.unsqueeze} > max.graph.ops.unsqueeze(x, axis) Inserts a size-1 dimension into a symbolic tensor. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to unsqueeze. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The index at which to insert a new dimension into the input’s shape. Elements at that index or higher are shifted back. If negative, it indexes relative *1 plus* the rank of the tensor. For example, `unsqueeze(v, -1)` adds a new dimension at the end, and `unsqueeze(v, -2)` inserts the dimension immediately before the last dimension. **Returns:** A symbolic tensor with the same number of elements as the input tensor, whose rank is 1 larger than the rank of the input tensor. The result’s shape at the `axis` dimension is a static dimension of size 1. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Complex ### `as_interleaved_complex()` {#max.graph.ops.as_interleaved_complex} > max.graph.ops.as\_interleaved\_complex(x) Reshapes the input symbolic tensor as complex from alternating (real, imag). **Parameters:** * **interleaved** – A symbolic tensor representing complex numbers as alternating pairs of (real, imag) real-valued numbers. Its last dimension must have an even size. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A symbolic tensor representing the complex-valued tensor, but with the values pulled out as complex numbers. The result has the same dimensions for all dimensions except the last dimension, which is halved, and then a final dimension of size 2 representing the complex value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Constant ### `constant()` {#max.graph.ops.constant} > max.graph.ops.constant(value, dtype, device) Adds a node representing a constant operation. The value of this constant will have the type TensorType with the same shape as value. If value is a scalar type, it will create a TensorType with 0 dimensions. The constant will be loaded with the specified dtype. If the constant does not fit within the specified dtype, an error is raised. Warning: Loading the constant could result in precision loss. For example, loading 16777217 as a float32 will result in 16777216.0. **Parameters:** * **value** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) ) – The constant’s value. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The constant tensor’s element type. * **device** (`DeviceRef` ) – The device the constant lives on. **Returns:** A graph value containing the constant data as an attribute. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Convolution ### `conv2d()` {#max.graph.ops.conv2d} > max.graph.ops.conv2d(x, filter, stride=(1, 1), dilation=(1, 1), padding=(0, 0, 0, 0), groups=1, bias=None) Computes the 2-D convolution product of the input with the given filter, bias, strides, dilations, paddings, and groups. The op supports 2-D convolution, with the following layout assumptions: * input x has NHWC layout, i.e., (batch\_size, height, width, in\_channels) * filter has layout RSCF, i.e., (height, width, in\_channels / num\_groups, out\_channels) * bias has shape (out\_channels,) The padding values are expected to take the form (pad\_dim1\_before, pad\_dim1\_after, pad\_dim2\_before, pad\_dim2\_after…) and represent padding 0’s before and after the indicated *spatial* dimensions in input. In 2-D convolution, dim1 here represents H and dim2 represents W. In Python like syntax, padding a 2x3 spatial input with \[0, 1, 2, 1] would yield: ```python input = [ [1, 2, 3], [4, 5, 6] ] ## Shape is 2x3 padded_input = [ [0, 0, 1, 2, 3, 0], [0, 0, 4, 5, 6, 0], [0, 0, 0, 0, 0, 0] ] ## Shape is 3x6 ``` This op currently only supports strides and padding on the input. **Parameters:** * **input** – An NHWC input tensor to perform the convolution upon. * **filter** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The convolution filter in RSCF layout: (height, width, in\_channels / num\_groups, out\_channels). * **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the convolution operation. * **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel points. * **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The amount of padding applied to the input. * **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – When greater than 1, divides the convolution into multiple parallel convolutions. The number of input and output channels must both be divisible by the number of groups. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) **Returns:** A symbolic tensor value with the convolution applied. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `conv3d()` {#max.graph.ops.conv3d} > max.graph.ops.conv3d(x, filter, stride=(1, 1, 1), dilation=(1, 1, 1), padding=(0, 0, 0, 0, 0, 0), groups=1, bias=None) Computes the 3-D convolution product of the input with the given filter, strides, dilations, paddings, and groups. The op supports 3-D convolution, with the following layout assumptions: * input has NDHWC layout, i.e., (batch\_size, depth, height, width, in\_channels) * filter has layout RSCF, i.e., (depth, height, width, in\_channels / num\_groups, out\_channels) The padding values are expected to take the form (pad\_dim1\_before, pad\_dim1\_after, pad\_dim2\_before, pad\_dim2\_after…) and represent padding 0’s before and after the indicated *spatial* dimensions in input. In 3-D convolution, dim1 here represents D, dim2 represents H and dim3 represents W. In Python like syntax, padding a 2x3 spatial input with \[0, 1, 2, 1] would yield: ```python input = [ [1, 2, 3], [4, 5, 6] ] ## Shape is 2x3 padded_input = [ [0, 0, 1, 2, 3, 0], [0, 0, 4, 5, 6, 0], [0, 0, 0, 0, 0, 0] ] ## Shape is 3x6 ``` This op currently only supports strides and padding on the input. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – An NDHWC input tensor to perform the convolution upon. * **filter** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The convolution filter in RSCF layout: (depth, height, width, in\_channels / num\_groups, out\_channels). * **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the convolution operation. * **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel points. * **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The amount of padding applied to the input. * **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – When greater than 1, divides the convolution into multiple parallel convolutions. The number of input and output channels must both be divisible by the number of groups. * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) **Returns:** A symbolic tensor value with the convolution applied. Output shape = (batch\_size, depth, height, width, out\_channels). **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `conv2d_transpose()` {#max.graph.ops.conv2d_transpose} > max.graph.ops.conv2d\_transpose(x, filter, stride=(1, 1), dilation=(1, 1), padding=(0, 0, 0, 0), output\_paddings=(0, 0), bias=None) Computes the 2-D deconvolution of the input with the given filter, strides, dilations, paddings, and groups. The op supports the transpose (gradient) of convolution, with the following layout assumptions: (note the out\_channel is w\.r.t. the original convolution) * input x has NHWC layout, i.e., (batch\_size, height, width, in\_channels) * filter has layout RSCF, i.e., (kernel\_height, kernel\_width, out\_channels, in\_channels) * bias has shape (out\_channels,) The padding values are expected to take the form in the form \[\[0, 0], \[pad\_top, pad\_bottom], \[pad\_left, pad\_right], \[0, 0]]. This op effectively computes the gradient of a convolution with respect to its input (as if the original convolution operation had the same filter and hyperparameters as this op). A visualization of the computation can be found in . The padding values are expected to take the form (pad\_dim1\_before, pad\_dim1\_after, pad\_dim2\_before, pad\_dim2\_after…) and represent padding 0’s before and after the indicated *spatial* dimensions in input. In 2D ConvTranspose, dim1 here repesents H\_out and dim2 represents W\_out. In python like syntax, padding a 2x4 spatial output with \[0, 1, 2, 1] would yield: ```python output = [ [1, 2, 3, 4], [5, 6, 7, 8] ] ## Shape is 2x4 padded_input = [ [3], ] ## Shape is 1x1 ``` **Parameters:** * **input** – An NHWC input tensor to perform the convolution upon. * **filter** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The convolution filter in RSCF layout: (height, width, out\_channels, in\_channels). * **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the sliding window for each dimension of input. If a single value is given it is replicated in the H and W dimension. By default the N and C dimensions are set to 0. * **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel points. * **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The amount of padding applied to the input. * **output\_paddings** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – this argument is meant to resolve the ambiguity of multiple potential output shapes when any stride is greater than 1. Basically, we’ll add output\_paddings\[i] number of zeros at the end of output’s ith axis. We only support output\_paddings = 0. * **bias** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) – tensor of shape (out\_channels,) * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A symbolic tensor value with the convolution applied. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Control flow ### `cond()` {#max.graph.ops.cond} > max.graph.ops.cond(pred, out\_types, then\_fn, else\_fn) Conditionally execute one of two branches based on a boolean predicate. Both branches must return the same number and types of values as specified in `out_types`. Buffer mutations in branches are tracked automatically through the chain mechanism. Examples: 1. Basic conditional with return values: > ```python > def then_fn(): > return ops.constant(1, DType.int32, device=DeviceRef.CPU()) > def else_fn(): > return ops.constant(0, DType.int32, device=DeviceRef.CPU()) > ​ > result = ops.cond( > pred, > [TensorType(DType.int32, [], device=device)], > then_fn, > else_fn > ) > ``` 2. Conditional with buffer mutations: > ```python > def then_fn(): > ops.inplace_custom("increment", [buffer]) > def else_fn(): > ops.inplace_custom("decrement", [buffer]) > ​ > ops.cond(pred, None, then_fn, else_fn) > ``` :: :param pred: Boolean scalar tensor of type `DType.bool` determining branch execution :param out\_types: Expected output types for both branches. Use [`None`](https://docs.python.org/3/library/constants.html#None) for branches that don’t return values :param then\_fn: Callable executed when `pred` is True. Must return values matching `out_types` if `out_types` is not [`None`](https://docs.python.org/3/library/constants.html#None) :param else\_fn: Callable executed when `pred` is False. Must return values matching `out_types` if `out_types` is not [`None`](https://docs.python.org/3/library/constants.html#None) **Returns:** List of output values from executed branch. Returns empty list when `out_types` is [`None`](https://docs.python.org/3/library/constants.html#None) **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If branches return different numbers of results or result types don’t match `out_types` **Parameters:** * **pred** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **out\_types** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Type`](type.md#max.graph.type.Type) `]` `|` `None` ) * **then\_fn** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) ) * **else\_fn** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)] ##### NOTE Buffer operations in branches automatically update the global chain state to maintain mutation ordering constraints ### `while_loop()` {#max.graph.ops.while_loop} > max.graph.ops.while\_loop(initial\_values, predicate, body) Execute a loop until the predicate evaluates to false. Both the predicate and body functions must take in as arguments the same number and types of values as specified in the init\_args. The predication function must return only a boolean scalar tensor of type `DType.bool`. The body function must return a list of values matching the types of init\_args. The following example demonstrates a basic while loop with a single argument: ```python from max.graph import Graph, ops from max.dtype import DType with Graph("while_loop_example") as g: x = ops.constant(0, dtype=DType.int32, device=DeviceRef.CPU()) def pred(x): return x **Parameters:** * **initial\_values** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Value`](Value.md#max.graph.Value) `]` `|` [`Value`](Value.md#max.graph.Value) ) – Initial values for loop arguments. Must be non-empty. * **predicate** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `[` `[` `...` `]` `,` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `]` ) – Callable that takes loop arguments and returns a boolean scalar tensor of type `DType.bool` determining loop continuation. * **body** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `[` `[` `...` `]` `,` [`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Value`](Value.md#max.graph.Value) `]` `]` ) – Callable that takes loop arguments and returns updated values matching the types of init\_args. **Returns:** List of output values from the final loop iteration. **Raises:** * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If init\_args is empty. * [**NotImplementedError**](https://docs.python.org/3/library/exceptions.html#NotImplementedError) – If any init\_arg is a `BufferValue`. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)] ##### NOTE Buffer operations are currently not supported. ## Custom A custom operation (op) is a user-defined kernel written in [Mojo](/mojo/manual/) that is registered and executed within the computation graph. It allows you to extend the graph’s capabilities by implementing your own specialized operations. For example, you might write an `add_one_custom` function in Mojo that adds 1 to each element of a matrix. Then you’d call the operation by its string name in the [`max.graph.Graph`](Graph.md#max.graph.Graph): ```python def create_graph(rows: int, columns: int, dtype: DType) -> Graph: """Configure a graph with a custom operation.""" graph = Graph( "addition", forward=lambda x: ops.custom( name="add_one_custom", values=[x], out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)], )[0].tensor, input_types=[ TensorType(dtype, shape=[rows, columns]), ], ) return graph ``` Custom ops also support parametrization on int, str, and dtype. This means you can define custom parametric Mojo types then use those types as inputs to custom ops staged in the graph API. For example, given the following Counter Mojo type: ```mojo struct Counter[stride: Int](Movable): var a: Int var b: Int fn __init__(out self): self.a = 0 self.b = 0 fn __init__(out self, a: Int, b: Int): self.a = a self.b = b fn __moveinit__(out self, owned other: Self): self.a = other.a self.b = other.b fn bump(mut self): self.a += Self.stride self.b += self.a ``` The following [`inplace_custom()`](#max.graph.ops.inplace_custom) call stages an op that bumps the parametric `Counter` type. Notice that we’re using `_OpaqueType` here, which is a Python-based graph type that represents a Mojo value (from `max.graph.type`), but it’s currently an internal API and subject to change. ```python counter_type = _OpaqueType("Counter") ## ... create counter object. ## Stage a graph that bumps the counter, parametrized on stride. bumper_graph = Graph( "bumper", forward=lambda x: ops.inplace_custom( "bump_counter", values=[x], out_types=[], parameters={"stride": 2}, ), input_types=[counter_type], ) ``` ### `custom()` {#max.graph.ops.custom} > max.graph.ops.custom(name, values, out\_types, parameters=None, device=None) Creates a node to execute a custom graph operation in the graph. The custom op should be registered by annotating a function with the [@compiler.register](/max/api/mojo-decorators/compiler-register/) decorator. **Parameters:** * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The op name provided to `@compiler.register`. * **values** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Value`](Value.md#max.graph.Value) `]` ) – The op function’s arguments. * **out\_types** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Type`](type.md#max.graph.type.Type) `]` ) – The list of op function’s return type. * **parameters** ([`Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`bool`](https://docs.python.org/3/library/functions.html#bool) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`DType`](../dtype.md#max.dtype.DType) `]` `|` `None` ) – Dictionary of extra parameters expected by the kernel. * **device** (`DeviceRef` `|` `None` ) – Device that the op is assigned to. This becomes a target parameter to the kernel. **Returns:** Symbolic values representing the outputs of the op in the graph. These correspond 1:1 with the types passed as `out_types`. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Value*](Value.md#max.graph.Value)] ### `inplace_custom()` {#max.graph.ops.inplace_custom} > max.graph.ops.inplace\_custom(name, values, out\_types=None, parameters=None, device=None) Creates a node to execute an in-place custom graph operation in the graph. The custom op should be registered by annotating a function with the [@compiler.register](/max/api/mojo-decorators/compiler-register/) decorator. **Parameters:** * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The op name provided to `@compiler.register`. * **values** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Value`](Value.md#max.graph.Value) `]` ) – The op function’s arguments. * **parameters** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`bool`](https://docs.python.org/3/library/functions.html#bool) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`DType`](../dtype.md#max.dtype.DType) `]` `|` `None` ) – Dictionary of extra parameters expected by the kernel. * **device** (`DeviceRef` `|` `None` ) – Device that the op is assigned to. This becomes a target parameter to the kernel. * **out\_types** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Type`](type.md#max.graph.type.Type) `]` `|` `None` ) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Value*](Value.md#max.graph.Value)] ## Debug Operations used to help debug your graph. ### `print()` {#max.graph.ops.print} > max.graph.ops.print(value, label='debug\_tensor') Prints the value of a tensor or a string during graph execution. This function is used to output the current value of a tensor and is primarily used for debugging purposes within the context of the Max Engine and its graph execution framework. This is particularly useful to verify the intermediate results of your computations are as expected. By printing the tensor values, you can visualize the data flowing through the graph, which helps in understanding how the operations are transforming the data. When labeling the function you can assign the output, making it easier to identify which tensor’s value is being printed, especially when there are multiple print statements in a complex graph. ```python def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, Any]: input_type = TensorType(dtype=DType.float32, shape=(1,), device=DeviceRef.CPU()) with Graph( "simple_add_graph", input_types=(input_type, input_type) ) as graph: lhs, rhs = graph.inputs out = ops.add(lhs, rhs) ops.print(out, label="addition_output") # Pass the output tensor here graph.output(out) print("final graph:", graph) ``` **Parameters:** * **value** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The value to print. Can be either a string or a TensorValue. * **label** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – A label to identify the printed value. Defaults to `debug_tensor`. ## Distributed ### `allgather()` {#max.graph.ops.allgather} > max.graph.ops.allgather(inputs, dim=0) Collective allgather operation. This op is a collective op which takes in tensors from different devices and outputs tensors on different devices. In particular, this operation will gather the inputs across different devices and concatenates them along the 0th dimension. The result is then broadcasted back to the same devices that the inputs came from. **Parameters:** * **inputs** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `]` ) – The input tensors to gather. * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Dimension to concatenate the input tensors. Defaults to 0. **Returns:** An iterable outputs which all hold the gathered output. Each output is a rank-1 array. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)] ### `sum()` {#max.graph.ops.allreduce.sum} > max.graph.ops.allreduce.sum(inputs, signal\_buffers) Collective allreduce summation operation. This op is a collective op which takes in tensors from different devices and outputs tensors on different devices. In particular, this operation will gather the inputs across different devices and reduce them via a summation operation. The result is then broadcasted back to the same devices that the inputs came from. This version of the allreduce sum op uses device-to-device transfers and hence is expected to be much slower than the `ops.allreduce.sum` version. **Parameters:** * **inputs** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `]` ) – The input tensors to reduce. * **signal\_buffers** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`BufferValue`](BufferValue.md#max.graph.BufferValue) `]` ) – Device buffer values used for synchronization. **Returns:** An iterable outputs which all hold the reduction output. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)] ## Elementwise An elementwise operation performs the same calculation on each element of an input tensor. These operations take tensors of compatible shapes and apply the specified operation to each element pair. For example, the following demonstrates how to add two tensors using the [`add()`](#max.graph.ops.add) function: ```python import numpy as np from max import engine from max.dtype import DType from max.graph import Graph, TensorType, ops def main(): input_type = TensorType(dtype=DType.float32, shape=(2,)) with Graph("simple_add_graph", input_types=(input_type, input_type)) as graph: x = graph.inputs[0] # First operand y = graph.inputs[1] # Second addend out = ops.add(x, y) graph.output(out) session = engine.InferenceSession() model = session.load(graph) input_0 = np.array([10.0, 8.0], dtype=np.float32) input_1 = np.array([2.0, 4.0], dtype=np.float32) ret = model.execute(input_0, input_1) print("\nAddition computation:") print("Result ", ret["output0"]) if __name__ == "__main__": main() ``` ### `abs()` {#max.graph.ops.abs} > max.graph.ops.abs(x) Computes the elementwise absolute value of a symbolic tensor. Creates a new op node to compute the elementwise absolute value of a symbolic tensor and adds it to the graph, returning the symbolic result. The following demonstrates how to compute the absolute value using the [`abs()`](#max.graph.ops.abs) function: ```python def abs_graph(): input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU()) with Graph("abs_graph", input_types=(input_type,)) as graph: x = graph.inputs[0] out = ops.abs(x) graph.output(out) ``` **Parameters:** * **value** – The symbolic tensor to use as the input to the absolute value computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the absolute value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `add()` {#max.graph.ops.add} > max.graph.ops.add(lhs, rhs) Adds two symbolic tensors. Creates a new op node to compute the addition of two symbol tensor values and adds it to the graph, returning the symbolic result. The following shows an example of the add() function with two inputs: ```python def add_graph(): input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU()) with Graph("add_graph", input_types=(input_type, input_type)) as graph: x = graph.inputs[0] y = graph.inputs[1] out = ops.add(x, y) graph.output(out) ``` * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the addition. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the addition. * **location** – An optional location for a more specific error message. **Returns:** A symbolic tensor value representing the output of the addition. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `cos()` {#max.graph.ops.cos} > max.graph.ops.cos(x) Computes the elementwise cosine of a symbolic tensor. Creates a new op node to compute the elementwise cosine of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the cos computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the cosine value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `div()` {#max.graph.ops.div} > max.graph.ops.div(lhs, rhs) Divides two symbolic tensors. Creates a new op node to compute the division of two symbol tensor values and adds it to the graph, returning the symbolic result. * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the division. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the division. * **location** – An optional location for a more specific error message. **Returns:** A symbolic tensor value representing the output of the division. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `equal()` {#max.graph.ops.equal} > max.graph.ops.equal(lhs, rhs) Computes the elementwise equality comparison between two symbolic tensors. Creates a new op node to compute the equality comparison of two symbol tensor values and adds it to the graph, returning the symbolic result. ```python def equal_graph(): input_type = TensorType(dtype=DType.float32, shape=(3,), device=DeviceRef.CPU()) with Graph("equal_graph", input_types=(input_type, input_type)) as graph: x = graph.inputs[0] # First input y = graph.inputs[1] # Second input out = ops.equal(x, y) graph.output(out) ``` * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the equality comparison. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the equality comparison. **Returns:** A symbolic tensor value representing the output of the equality comparison. The result will have: * the same dtype as the type promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `erf()` {#max.graph.ops.erf} > max.graph.ops.erf(x) Computes the elementwise error function of a symbolic tensor. Creates a new op node to compute the elementwise error function of a symbolic tensor and adds it to the graph, returning the symbolic result. The error function `erf` is defined as the probability that a randomly sampled normal distribution falls within a given range. **Parameters:** * **value** – The symbolic tensor to use as the input to the error function computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the error function value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `exp()` {#max.graph.ops.exp} > max.graph.ops.exp(x) Computes the elementwise exp function of a symbolic tensor. Creates a new op node to compute the elementwise exp function of a symbolic tensor and adds it to the graph, returning the symbolic result. `exp` is defined as `exp(x) = e^x`, where `e` is Euler’s number. **Parameters:** * **value** – The symbolic tensor to use as the input to the exp function computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the exp value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `floor()` {#max.graph.ops.floor} > max.graph.ops.floor(x) Computes the elementwise floor of a symbolic tensor. Creates a new op node to compute the elementwise floor of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the floor computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the floor value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `gelu()` {#max.graph.ops.gelu} > max.graph.ops.gelu(x, approximate='none') Computes the elementwise gelu of a symbolic tensor. Creates a new op node to compute the elementwise gelu of a symbolic tensor and adds it to the graph, returning the symbolic result. For `approximate == "none"`, the exact gelu function is computed. For `approximate == "tanh"`, the approximation: $$ gelu(x) = 0.5 * x * (1.0 + tanh(0.7978845608028654 * (x + 0.044715 * x**3))) $$ is used. For `approximate == "quick"`, the approximation: $$ gelu(x) = sigmoid(1.702 * x) * x $$ is used. **Parameters:** * **value** – The symbolic tensor to use as the input to the gelu computation. * **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) * **approximate** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) **Returns:** A new symbolic tensor value representing the output of the gelu value computation. **Raises:** * **Error** – If the symbol doesn’t represent a tensor value. * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the approximation method is invalid. ### `greater()` {#max.graph.ops.greater} > max.graph.ops.greater(lhs, rhs) Computes the elementwise greater than comparison between two symbolic tensors. Creates a new op node to compute the greater than comparison of two symbol tensor values and adds it to the graph, returning the symbolic result. ```python def greater_than_graph(): input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU()) with Graph("greater_graph", input_types=(input_type, input_type)) as graph: x = graph.inputs[0] # Left hand side y = graph.inputs[1] # Right hand side out = ops.greater(x, y) graph.output(out) ``` * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the greater than comparison. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the greater than comparison. **Returns:** A symbolic tensor value representing the output of the greater than comparison. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `greater_equal()` {#max.graph.ops.greater_equal} > max.graph.ops.greater\_equal(lhs, rhs) Computes the elementwise greater-or-equal comparison between two symbolic tensors. Creates a new op node to compute the equality comparison of two symbol tensor values and adds it to the graph, returning the symbolic result. * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the greater-or-equal comparison. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the greater-or-equal comparison. **Returns:** A symbolic tensor value representing the output of the greater-or-equal comparison. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `is_inf()` {#max.graph.ops.is_inf} > max.graph.ops.is\_inf(x) Computes the elementwise is\_inf of a symbolic tensor. Creates a new op node to compute the elementwise is\_inf of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the is\_inf computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** * element type `bool`, true if the element at a given position is plus or minus infinity, false otherwise * the same shape as the input value. **Return type:** The result will have **Raises:** **Raises** – If the symbol doesn’t represent a tensor value. ### `is_nan()` {#max.graph.ops.is_nan} > max.graph.ops.is\_nan(x) Computes the elementwise is\_nan of a symbolic tensor. Creates a new op node to compute the elementwise is\_nan of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the is\_nan computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** * element type `bool`, true if the element at a given position is NaN, false otherwise * the same shape as the input value. **Return type:** The result will have **Raises:** **Error** – If the symbol doesn’t represent a tensor value. ### `log()` {#max.graph.ops.log} > max.graph.ops.log(x) Computes the elementwise natural logarithm of a symbolic tensor. Creates a new op node to compute the elementwise natural logarithm of a symbolic tensor and adds it to the graph, returning the symbolic result. The natural logarithm function `log` is defined as the inverse of the exponential function `exp()`. In other words, it computes the value `y` in the equation `x = e^y` where `e` is Euler’s number. `log(x)` is undefined for `x **Parameters:** * **value** – The symbolic tensor to use as the input to the natural logarithm computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the natural logarithm value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `log1p()` {#max.graph.ops.log1p} > max.graph.ops.log1p(x) Computes the elementwise logarithm of 1 plus a symbolic tensor. Creates a new op node to compute the elementwise log1p of a symbolic tensor and adds it to the graph, returning the symbolic result. The `log1p` function is defined as `log1p(x) = log(1 + x)`, where `log()` is the natural logarithm. Using `log1p(x)` rather than computing `log(1 + x)` can give greater numerical precision results. `log(x)` is undefined for `x **Parameters:** * **value** – The symbolic tensor to use as the input to the log1p computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the log1p value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `logical_not()` {#max.graph.ops.logical_not} > max.graph.ops.logical\_not(x) Computes the elementwise logical\_not of a symbolic tensor. Creates a new op node to compute the elementwise logical\_not of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the logical\_not computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** * element type `bool`, true if the element at a given position is plus or minus infinity, false otherwise * the same shape as the input value. **Return type:** The result will have **Raises:** **Error** – If the symbol doesn’t represent a tensor value. ### `logsoftmax()` {#max.graph.ops.logsoftmax} > max.graph.ops.logsoftmax(x) Computes the elementwise logsoftmax of a symbolic tensor. Creates a new op node to compute the elementwise logsoftmax of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the logsoftmax computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the logsoftmax value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `max()` {#max.graph.ops.max} > max.graph.ops.max(x, y=None, /, axis=None) Overload for ops.elementwise.max and ops.reduction.max. * If two tensors are provided, axis is ignored and returns an elementwise maximum. * If one tensor is provided, compute ops.reduction.max on the tensor and axis. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **y** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `min()` {#max.graph.ops.min} > max.graph.ops.min(x, y=None, /, axis=None) Overload for ops.elementwise.min and ops.reduction.min. * If two tensors are provided, axis is ignored and returns an elementwise minimum. * If one tensor is provided, compute ops.reduction.min on the tensor and axis. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **y** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `mod()` {#max.graph.ops.mod} > max.graph.ops.mod(lhs, rhs) Computes the elementwise modulus of two symbolic tensors. Creates a new op node to compute the modulus of two symbol tensor values and adds it to the graph, returning the symbolic result. * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the modulus operation. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the modulus operation. **Returns:** A symbolic tensor value representing the output of the modulus operation. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `mul()` {#max.graph.ops.mul} > max.graph.ops.mul(lhs, rhs) Computes the elementwise multiplication of two symbolic tensors. Creates a new op node to compute the multiplication of two symbol tensor values and adds it to the graph, returning the symbolic result. * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the multiplication. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the multiplication. **Returns:** A symbolic tensor value representing the output of the multiplication. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `negate()` {#max.graph.ops.negate} > max.graph.ops.negate(x) Computes the elementwise negation of a symbolic tensor. Creates a new op node to compute the elementwise negation of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the negation computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** * element type `bool`, true if the element at a given position is plus or minus infinity, false otherwise * the same shape as the input value. **Return type:** The result will have **Raises:** **Error** – If the symbol doesn’t represent a tensor value. ### `not_equal()` {#max.graph.ops.not_equal} > max.graph.ops.not\_equal(lhs, rhs) Computes the elementwise inequality comparison between two symbolic tensors. Creates a new op node to compute the inequality comparison of two symbol tensor values and adds it to the graph, returning the symbolic result. ```python def not_equal_graph(): input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU()) with Graph("not_equal_graph", input_types=(input_type, input_type)) as graph: x = graph.inputs[0] # Left hand side y = graph.inputs[1] # Right hand side out = ops.not_equal(x, y) graph.output(out) ``` * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the inequality comparison. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the inequality comparison. **Returns:** A symbolic tensor value representing the output of the inequality comparison. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `outer()` {#max.graph.ops.outer} > max.graph.ops.outer(lhs, rhs) Computes the outer product of two symbolic vectors. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The left side of the product. Whatever its shape, it will be flattened to a rank-1 vector. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The right side of the product. Whatever its shape, it will be flattened to a rank-1 vector. Must have the same number of elements as lhs. **Returns:** A symbolic tensor representing the [outer product](\[https://en.wikipedia.org/wiki/Outer_product]\(https://en.wikipedia.org/wiki/Outer_product\)) of the two input vectors. It will have rank 2, with the dimension sizes being the number of elements of lhs and rhs respectively. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `pow()` {#max.graph.ops.pow} > max.graph.ops.pow(lhs, rhs) Computes the elementwise exponentiation of two symbolic tensors. Creates a new op node to compute the exponentiation of two symbol tensor values and adds it to the graph, returning the symbolic result. * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the exponentiation. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the exponentiation. **Returns:** A symbolic tensor value representing the output of the exponentiation. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `relu()` {#max.graph.ops.relu} > max.graph.ops.relu(x) Computes the elementwise relu of a symbolic tensor. Creates a new op node to compute the elementwise relu of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the relu computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the relu value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `round()` {#max.graph.ops.round} > max.graph.ops.round(x) Computes the elementwise round of a symbolic tensor. Creates a new op node to compute the elementwise round of a symbolic tensor and adds it to the graph, returning the symbolic result. Rounding is done with ties towards the nearest even number. For example, if the model has one input tensor: ```python def round_graph(): input_type = TensorType(dtype=DType.float32, shape=(4,), device=DeviceRef.CPU()) with Graph("round_graph_example", input_types=(input_type,)) as graph: x = graph.inputs[0] out = ops.round(x) graph.output(out) ``` **Parameters:** * **value** – The symbolic tensor to use as the input to the round computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the round value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `rsqrt()` {#max.graph.ops.rsqrt} > max.graph.ops.rsqrt(x) Computes the elementwise inverse-square-root of a symbolic tensor. Creates a new op node to compute the elementwise rsqrt of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the rsqrt computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the rsqrt value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `sigmoid()` {#max.graph.ops.sigmoid} > max.graph.ops.sigmoid(x) Computes the elementwise sigmoid of a symbolic tensor. Creates a new op node to compute the elementwise sigmoid of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the sigmoid computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the sigmoid value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `silu()` {#max.graph.ops.silu} > max.graph.ops.silu(x) Computes the elementwise silu of a symbolic tensor. Creates a new op node to compute the elementwise silu of a symbolic tensor and adds it to the graph, returning the symbolic result. `silu` is defined as `silu(x) = x * sigmoid(x)`. **Parameters:** * **value** – The symbolic tensor to use as the input to the silu computation. * **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) **Returns:** A new symbolic tensor value representing the output of the silu value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. ### `sin()` {#max.graph.ops.sin} > max.graph.ops.sin(x) Computes the elementwise sine of a symbolic tensor. Creates a new op node to compute the elementwise sine of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the sin computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the sin value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `softmax()` {#max.graph.ops.softmax} > max.graph.ops.softmax(x) Computes the elementwise softmax of a symbolic tensor. Creates a new op node to compute the elementwise softmax of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the softmax computation. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the softmax value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `sqrt()` {#max.graph.ops.sqrt} > max.graph.ops.sqrt(x) Computes the elementwise sqrt of a symbolic tensor. Creates a new op node to compute the elementwise sqrt of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the sqrt computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the sqrt value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `sub()` {#max.graph.ops.sub} > max.graph.ops.sub(lhs, rhs) Computes the elementwise subtraction of two symbolic tensors. Creates a new op node to compute the subtraction of two symbol tensor values and adds it to the graph, returning the symbolic result. ```python def sub_graph(): input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU()) with Graph("sub_graph", input_types=(input_type, input_type)) as graph: x = graph.inputs[0] # Minuend (number being subtracted from) y = graph.inputs[1] # Subtrahend (number being subtracted) out = ops.sub(x, y) graph.output(out) ``` * * If `lhs` and `rhs` have different dtypes, they will be promoted according to the dtype promotion rules before the operation. * If `lhs` and `rhs` have different shapes, they will be broadcast to the same shape according to broadcasting rules\` before the operation. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the subtraction. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the subtraction. **Returns:** A symbolic tensor value representing the output of the subtraction. The result will have: * the same dtype as the type-promotion of the two input dtypes * the same shape as the broadcast of the two input shapes. **Raises:** * **Error** – If the input values’ shapes are not compatible for broadcasting. * **Error** – If one of the input values has an unsupported dtype. * **Error** – If the two symbols are parts of different graphs. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `tanh()` {#max.graph.ops.tanh} > max.graph.ops.tanh(x) Computes the elementwise tanh of a symbolic tensor. Creates a new op node to compute the elementwise tanh of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the tanh computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the tanh value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `trunc()` {#max.graph.ops.trunc} > max.graph.ops.trunc(x) Computes the elementwise truncation of a symbolic tensor. Creates a new op node to compute the elementwise truncation of a symbolic tensor and adds it to the graph, returning the symbolic result. **Parameters:** * **value** – The symbolic tensor to use as the input to the truncation computation. If it’s not a floating-point DType, an exception will be raised. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor value representing the output of the truncation value computation. **Raises:** **Error** – If the symbol doesn’t represent a tensor value. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Fast fourier transforms ### `irfft()` {#max.graph.ops.irfft} > max.graph.ops.irfft(input\_tensor, n=None, axis=-1, normalization=Normalization.BACKWARD, input\_is\_complex=False) Compute the inverse real FFT of the input tensor. **Parameters:** * **input\_tensor** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to compute the inverse real FFT of. * **n** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – The size of the output tensor. Must be an int, and cannot be a symbolic Tensor. The input tensor will be padded or truncated to n // 2 + 1 along the specified axis. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to compute the inverse real FFT of. * **normalization** (`Normalization` `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The normalization to apply to the output tensor. Can be “backward”, “ortho”, or “forward”. When “backward”, the output is divided by n. When “ortho”, the output is divided by sqrt(n). When “forward”, no normalization is applied. * **input\_is\_complex** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether the input tensor is already interleaved complex. The last dimension of the input tensor must be 2, and is excluded from the dimension referred to by axis. **Returns:** The inverse real FFT of the input tensor. The shape of the output tensor is the same as the shape of the input tensor, except for the axis that the inverse real FFT is computed over, which is replaced by n. ## Linalg ### `band_part()` {#max.graph.ops.band_part} > max.graph.ops.band\_part(x, num\_lower=None, num\_upper=None, exclude=False) Masks out everything except a diagonal band of an input matrix. Copies a tensor setting everything outside the central diagonal band of the matricies to zero, where all but the last two axes are effectively batches, and the last two axes define sub matricies. Assumes the input has dimensions \[I, J, …, M, N], then the output tensor has the same shape as the input, and the values are given by ```python out[i, j, ..., m, n] = in_band(m, n) * input[i, j, ..., m, n]. ``` with the indicator function: ```python in_band(m, n) = ((num_lower is None || (m - n) **Parameters:** * **input** – The input to mask out. * **num\_lower** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – The number of diagonal bands to include below the central diagonal. If None, include the entire lower triangle. * **num\_upper** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – The number of diagonal bands to include above the central diagonal. If None, include the entire upper triangle. * **exclude** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If true, invert the selection of elements to mask. Elements in the band are set to zero. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A symbolic tensor value with the configured selection masked out to 0 values, and the remaining values copied from the input tensor. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the input tensor rank is less than 2, or if num\_lower/num\_upper are out of bounds for statically known dimensions. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `layer_norm()` {#max.graph.ops.layer_norm} > max.graph.ops.layer\_norm(input, gamma, beta, epsilon) Performs layer normalization. **Parameters:** * **input** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to normalize. * **gamma** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The gamma parameter of the normalization. * **beta** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The beta parameter of the normalization. * **epsilon** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – The epsilon parameter of the normalization. **Returns:** A graph tensor value with the normalization applied. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `matmul()` {#max.graph.ops.matmul} > max.graph.ops.matmul(lhs, rhs) Computes the matrix multiplication of two tensor graph values. Performs general matrix multiplication with broadcasting. If the lhs is 1D, it will be reshaped to `1xD`. If the rhs is 1D, it will be reshaped to `Dx1`. In both cases, the additional 1 dimensions will be removed from the output shape. For the multiplication, the innermost (rightmost) 2 dimensions are treated as a matrix. The lhs matrix will have the shape `MxK`. The rhs matrix will have the shape `KxN`. The output will have the shape MxN The `K` dimensions must be equivalent in both matrices. The remaining outer dimensions will be broadcasted. **Parameters:** * **lhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The left-hand-side of the matmul. * **rhs** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The right-hand-side of the matmul. * **location** – An optional location for a more specific error message. **Returns:** A tensor graph value representing he result of broadcasting the two matricies together and then performing a matrix multiply along the innermost two dimension of each tensor. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Buffer operations ### `buffer_load()` {#max.graph.ops.buffer_load} > max.graph.ops.buffer\_load(x) Loads the input buffer into a tensor. It loads the in-place mutable tensor to an immutable tensor graph value. This is semantically equivalent to a copy from the mutable tensor x to the mutable value-semantic tensor output. **Parameters:** **x** ([`BufferValue`](BufferValue.md#max.graph.BufferValue) ) – The buffer to be loaded to a tensor. **Returns:** A tensor graph value representing a copy of the buffer loaded. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `buffer_store()` {#max.graph.ops.buffer_store} > max.graph.ops.buffer\_store(destination, source) Stores the input tensor into the inout buffer. It stores the immutable input tensor x in the mutable tensor y. This is semantically equivalent to a copy from x tensor to the y buffer. **Parameters:** * **x** – The tensor to be stored in the buffer. * **y** – The buffer to store the tensor in. * **destination** ([`BufferValue`](BufferValue.md#max.graph.BufferValue) ) * **source** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) **Return type:** None ### `buffer_store_slice()` {#max.graph.ops.buffer_store_slice} > max.graph.ops.buffer\_store\_slice(destination, source, indices) Stores the input tensor to into a slice in the input buffer. It stores the immutable input tensor source in the mutable tensor destination. This is semantically equivalent to a copy from source tensor to a slice in the destination buffer at index specified by indices. **Parameters:** * **destination** ([`BufferValue`](BufferValue.md#max.graph.BufferValue) ) – The buffer to store the tensor in. * **source** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The tensor to be stored in the buffer. * **indices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`slice`](https://docs.python.org/3/library/functions.html#slice) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`slice`](https://docs.python.org/3/library/functions.html#slice) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` `|` `EllipsisType` `]` ) – The index in the buffer where the tensor should be stored **Return type:** None ## Call operations ### `call()` {#max.graph.ops.call} > max.graph.ops.call(graph, \*args) Call a graph with the provided arguments and return its results. This function invokes a previously defined graph, passing in the provided arguments and the current chain value, and returns the results. The body of the graph is ultimately inlined into the caller, so the chain value is only used for serialization if the subgraph’s body contains an operation that makes use of it in the first place. The current advantage of using subgraphs is that it offers a way to improve compile times for operations that are used repeatedly in a model. As a secondary benefit, it also makes the IR more readable by allowing control flow to be expressed in a more natural way. **Parameters:** * **graph** ([`Graph`](Graph.md#max.graph.Graph) ) – The graph to call * **\*args** ([`Value`](Value.md#max.graph.Value) ) – Arguments to pass to the called graph **Returns:** Either a single Value or a list of Values representing the graph outputs (excluding the chain value which is handled internally) **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Value*](Value.md#max.graph.Value)] ## Flatten ### `flatten()` {#max.graph.ops.flatten} > max.graph.ops.flatten(x, start\_dim=0, end\_dim=-1) Flattens the specified dims of a symbolic tensor. The number and order of the elements in the tensor is unchanged. All dimensions from start\_dim to end\_dim (inclusive) are merged into a single output dim. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **start\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **end\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Fold ### `fold()` {#max.graph.ops.fold} > max.graph.ops.fold(input, output\_size, kernel\_size, stride=1, dilation=1, padding=0) Combines an array of sliding blocks into a larger containing tensor. The input tensor must have shape `(N, C * kernel_sizes, L)` where `N` is the batch dimension, `C` is the number of channels, `kernel_sizes` is the product of the kernel sizes, and `L` is the number of local blocks. The resulting output tensor will have shape `(N, C, output_shape[0], output_shape[1])`. `L`, the number of blocks, must be equivalent to: `prod((output_size[d] + 2 * padding[d] - dilation[d] * (kernel_size[d] - 1) - 1) / stride[d] + 1)` where `d` is over all spatial dimensions. **Parameters:** * **input** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The 3D tensor to fold with shape `(N, C * kernel sizes, L)`. * **output\_size** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – Spacial dimensions of the output tensor. Must be a tuple of two ints. * **kernel\_size** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The size of the sliding blocks. Must be a tuple of two ints. * **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the sliding blocks in the input dimension (can be an int or a tuple of two ints). * **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel elements. (can be an int or a tuple of two ints). * **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – 0-paddings to be added on both sides of the inputs. (can be an int or a tuple of two ints). **Returns:** The folded 4D tensor with shape `(N, C, output_shape[0], output_shape[1])`. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Pad ### `pad()` {#max.graph.ops.pad} > max.graph.ops.pad(input, paddings, mode='constant', value=0) **Parameters:** * **input** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **paddings** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **mode** ([`Literal`](https://docs.python.org/3/library/typing.html#typing.Literal) `[` `'constant'` `]` ) * **value** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Permute ### `permute()` {#max.graph.ops.permute} > max.graph.ops.permute(x, dims) Permutes all dimensions of a symbolic tensor. **Parameters:** * **input** – The input symbolic tensor to transpose. * **dims** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The desired ordering of the dimensions in the output tensor. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor with the dimensions permuted to match the passed in order. It has the same elements and dtype, but the order of the elements is different according to the permutation. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Quantized ### `dequantize()` {#max.graph.ops.dequantize} > max.graph.ops.dequantize(encoding, quantized) Dequantizes a quantized tensor to floating point. NOTE: Currently this supports Q4\_0, Q4\_K, and Q6\_K encodings only. **Parameters:** * **encoding** ([`QuantizationEncoding`](quantization.md#max.graph.quantization.QuantizationEncoding) ) – The quantization encoding to use. * **quantized** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The quantized tensor to dequantize. **Returns:** The dequantized result (a floating point tensor). **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `qmatmul()` {#max.graph.ops.qmatmul} > max.graph.ops.qmatmul(encoding, config, lhs, \*rhs) Performs matrix multiplication between floating point and quantized tensors. This quantizes the `lhs` floating point value to match the encoding of the `rhs` quantized value, performs matmul, and then dequantizes the result. Beware that, compared to a regular matmul op, this one expects the `rhs` value to be transposed. For example, if the `lhs` shape is \[32, 64], and the quantized `rhs` shape is also `[32, 64]`, then the output shape is `[32, 32]`. That is, this function returns the result from: > dequantize(quantize(lhs) @ transpose(rhs)) The last two dimensions in `lhs` are treated as matrices and multiplied by `rhs` (which must be a 2D tensor). Any remaining dimensions in `lhs` are broadcast dimensions. NOTE: Currently this supports Q4\_0, Q4\_K, and Q6\_K encodings only. **Parameters:** * **encoding** ([`QuantizationEncoding`](quantization.md#max.graph.quantization.QuantizationEncoding) ) – The quantization encoding to use. * **lhs** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The non-quantized, left-hand-side of the matmul. * **\*rhs** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The transposed and quantized right-hand-side of the matmul and auxiliary tensor (if has). Must be rank 2 and in a supported \[quantization encoding] (/max/api/mojo/graph/quantization/). * **config** ([`QuantizationConfig`](quantization.md#max.graph.quantization.QuantizationConfig) `|` `None` ) **Returns:** The dequantized result (a floating point tensor). **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Range ### `range()` {#max.graph.ops.range} > max.graph.ops.range(start, stop, step, out\_dim=None, device=cpu:0, dtype=float32) Creates a sequence of numbers. The sequence goes from start with increments of size step up to (but not including) stop. All arguments are mandatory and must have the same element type. Note the following restrictions on input values: 1. step must be non-zero 2. stop - start must be zero or have the same sign as step **Parameters:** * **start** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The start of the range to generate. * **stop** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The range will be generated up to, but not including, this value. * **step** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The step size for the range. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` `None` ) – The expected output dimensions returned by the range op. These will be assert at graph execution time to be correct. * **device** (`DeviceRef` ) – Device of the result tensor. * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) **Returns:** A symbolic tensor value containing the defined range of values. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Repeat ### `repeat_interleave()` {#max.graph.ops.repeat_interleave} > max.graph.ops.repeat\_interleave(x, repeats, axis=None, out\_dim=None) Repeats elements of a tensor along the given dimension. Modeled after `torch.repeat_interleave`, with the constraint that For example, given `repeats=2` and the following input: ```python ## Input tensor with shape (2, 2) input = TensorValue(x) # Contains [[1.0, 2.0], [3.0, 4.0]] ``` `repeat_interleave` with `axis=0`: ```python ## Output tensor with shape (4, 2) output = repeat_interleave(input, repeats=2, axis=0) ## Contains [[1.0, 2.0], [1.0, 2.0], [3.0, 4.0], [3.0, 4.0]] ``` `repeat_interleave` with `axis=1`: ```python ## Output tensor with shape (2, 4) output = repeat_interleave(input, repeats=2, axis=1) ## Contains [[1.0, 1.0, 2.0, 2.0], [3.0, 3.0, 4.0, 4.0]] ``` `repeat_interleave` with `axis=None` (the default): `repeat_interleave` with `repeats=[2, 3]` and `axis=0`: ```python repeat_value = TensorValue([2, 3]) ## Output tensor with shape (5, 2) output = repeat_interleave(input, repeats=repeat_value, axis=0) ## Contains [[1.0, 2.0], [1.0, 2.0], [3.0, 4.0], [3.0, 4.0], [3.0, 4.0]] ``` ```python ## Output tensor with shape (8,) output = repeat_interleave(input, repeats=2) # axis = None ## Contains [1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0] ``` **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor. * **repeats** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The number of repetitions for each element. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – The dimension along which to repeat values. If axis is not specified or None (the default), flatten the input array and repeat the flattened values. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` `None` ) **Returns:** A symbolic tensor with the elements interleaved. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If `repeats` non-positive or if `axis` is out of range. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Tile ### `tile()` {#max.graph.ops.tile} > max.graph.ops.tile(x, repeats) Returns a new Tensor as the result of copying the input tensor N\_i times on each dimension, where N\_i = repeats\[i]. The i-th dimension of output shape will be the ith dimension of input shape multiplied by N\_i. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **repeats** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Transfer ### `transfer_to()` {#max.graph.ops.transfer_to} > max.graph.ops.transfer\_to(x, device) Device-to-Device transfer operation. This op transfers the input tensor from its current device over to another. A device represents a computation unit, like CPU, GPU, etc. This op is useful for instance when working with accelerators, like GPU, where for instance one may need to move data from GPU to GPU, or from one GPU to CPU. **Parameters:** * **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to transfer. * **device** (`DeviceRef` ) – The device to transfer to. **Returns:** A tensor transferred to device specified. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## TopK ### `top_k()` {#max.graph.ops.top_k} > max.graph.ops.top\_k(input, k, axis=-1) Returns tensor with only top K values along given axis. **Parameters:** * **input** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor from which to select top k. * **k** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of values to select from input. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis from which to select top k. **Returns:** Top K values, Top K indices **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue), [*TensorValue*](TensorValue.md#max.graph.TensorValue)] ## Reduction ### `argmax()` {#max.graph.ops.argmax} > max.graph.ops.argmax(x, axis=-1) Reduces a symbolic tensor using an argmax operation. When provided with a tensor with all identical elements, on CPU this will return the first element index in the tensor, on GPU this will return an arbitrary index. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation. * **axis** – The axis along which to compute the reduction. If negative, indexes from the last dimension. For example, a value of -1 will compute the reduction along the last dimension. **Returns:** A symbolic tensor representing the result of the argmax operation. The tensor will have the same rank as the input tensor, and the same shape except along the `axis` dimension which will have size 1. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `argmin()` {#max.graph.ops.argmin} > max.graph.ops.argmin(x, axis=-1) Reduces a symbolic tensor using an argmin operation. When provided with a tensor with all identical elements, on CPU this will return the first element index in the tensor, on GPU this will return an arbitrary index. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation. * **axis** – The axis along which to compute the reduction. If negative, indexes from the last dimension. For example, a value of -1 will compute the reduction along the last dimension. **Returns:** A symbolic tensor representing the result of the argmin operation. The tensor will have the same rank as the input tensor, and the same shape except along the `axis` dimension which will have size 1. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `mean()` {#max.graph.ops.mean} > max.graph.ops.mean(x, axis=-1) Reduces a symbolic tensor using a mean operation. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation. * **axis** – The axis along which to compute the reduction. If negative, indexes from the last dimension. For example, a value of -1 will compute the reduction along the last dimension. **Returns:** A symbolic tensor representing the result of the mean operation. The tensor will have the same rank as the input tensor, and the same shape except along the `axis` dimension which will have size 1. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `sum()` {#max.graph.ops.sum} > max.graph.ops.sum(x, axis=-1) Reduces a symbolic tensor using a sum operation. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation. * **axis** – The axis along which to compute the reduction. If negative, indexes from the last dimension. For example, a value of -1 will compute the reduction along the last dimension. **Returns:** A symbolic tensor representing the result of the sum operation. The tensor will have the same rank as the input tensor, and the same shape except along the `axis` dimension which will have size 1. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Indexing ### `argsort()` {#max.graph.ops.argsort} > max.graph.ops.argsort(x, ascending=True) Returns the indices that would sort a tensor. This function returns the indices that would sort the input tensor along its first dimension. The returned indices are of type int64. **Parameters:** * **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – Input tensor to be sorted. * **ascending** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If True (default), sort in ascending order. If False, sort in descending order. **Returns:** A tensor of indices of the same shape as the input tensor. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `nonzero()` {#max.graph.ops.nonzero} > max.graph.ops.nonzero(x, out\_dim) Returns the indices of all nozero elements in a tensor. Returns a tensor of indices of the nonzero values in the given tensor. The return value is a 2D tensor of shape \[out\_dim x rank\_in], where out\_dim is the number of nonzero elements in the input tensor, and rank\_in is the rank of the input tensor. Indices are generated in row-major order. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) ) – The newly generated dimension that is sized for the number of nonzero elements. **Returns:** A symbolic tensor of indices **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Cumulative operations ### `cumsum()` {#max.graph.ops.cumsum} > max.graph.ops.cumsum(x, axis=-1, exclusive=False, reverse=False) Computes the cumulative sum of the input tensor along the given axis. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor to sum over. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis along which to compute the sum. If negative, indexes from the last dimension. For example, a value of -1 will compute the sum along the last dimension. * **exclusive** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If set, start at 0 and exclude the final element. Otherwise, start with the first element. Said another way, cumsum computes \[sum(x\[…, :i, …]) for i in range(x.shape\[axis])]. If exclusive is set, the bounds are instead range(1, x.shape\[axis]). * **reverse** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If set, start from the end. In other words, the first element will be the total sum, with each element following counting downwards; or \[sum(x\[…, i:, …]) for i in range(x.shape\[axis])]. **Returns:** A symbolic tensor representing the result of the cumsum operation. The tensor will have the same type as the input tensor. The computed values will be the cumulative sum of the values along the given axis, according to the specified parameters: * if exclusive is set, the first value will be 0, and the last value will be excluded from the sum * if reverse is set, the sum will be computed starting at the back of the axis back to the front, rather than front-to-back **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Audio processing ### `hann_window()` {#max.graph.ops.hann_window} > max.graph.ops.hann\_window(window\_length, device, periodic=True, dtype=float32) Calculate a Hann window for a given length. Hann window function: $$ H[n] = 1/2 [1 - cos(2 * pi * n / (N - 1))] $$ where N is window\_length. **Parameters:** * **window\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The length of the window. * **device** (`DeviceRef` ) – The device to run the operation on. * **periodic** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – bool flag determines whether the returned window trims off the last duplicate value from the symmetric window and is ready to be used as a periodic window with functions like stft(). hann\_window(L, periodic=True) == hann\_window(L + 1, periodic=False)\[:-1]) * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The desired data type of the output tensor. **Returns:** A 1-D tensor of size (window\_length,) containing the window. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Slicing ### `chunk()` {#max.graph.ops.chunk} > max.graph.ops.chunk(x, chunks, axis=0) Chunk the tensor into an exact number of chunks along the specified dim. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The tensor to chunk. * **chunks** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of chunks to split the tensor into. chunks must statically evenly divide x.shape\[axis]. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to split the tensor along. **Returns:** A list of chunks tensors. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)] ### Example ```pycon >>> a = TensorValue([1, 2, 3, 4, 5]) >>> chunk(a, 2, 0) [TensorValue([1, 2]), TensorValue([3, 4])] ``` ### `concat()` {#max.graph.ops.concat} > max.graph.ops.concat(original\_vals, axis=0) Concatenates a list of symbolic tensors along an axis. **Parameters:** * **original\_vals** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` `Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `]` ) – A list of symbolic tensor values. Each tensor must have the same dtype and rank, and must have the same dimension size for each dimension other than `axis`. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to concatenate along. If negative, indexes relative to the end of the tensor shape. For instance, `concat(vs, -1)` will concat along the last dimension. **Returns:** A new symbolic tensor representing the concatenation result. It will have the same rank as each input tensor, and its dimensions will be the same as each input tensor’s for each dimension other than axis, which will have size equal to the sum of all tensor’s size for that dimension. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `gather()` {#max.graph.ops.gather} > max.graph.ops.gather(input, indices, axis=-1) Selects elements out of an input tensor by index. **Parameters:** * **input** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to select elements from. * **indices** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of index values to use for selection. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension which `indices` indexes from `input`. If negative, indexes relative to the end of the input tensor. For instance, `gather(input, indices, axis=-1)` will index against the last dimension of `input`. **Returns:** A new symbolic tensor representing the result of the gather operation. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `masked_scatter()` {#max.graph.ops.masked_scatter} > max.graph.ops.masked\_scatter(input, mask, updates, out\_dim) Creates a new symbolic tensor where the updates are written to input where mask is true. **Parameters:** * **input** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to write elements to. * **mask** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of boolean values to update. * **updates** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of elements to write to input. * **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) ) – The new data-dependent dimension. **Returns:** A new symbolic tensor representing the result of the masked\_scatter operation. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `scatter()` {#max.graph.ops.scatter} > max.graph.ops.scatter(input, updates, indices, axis=-1) Creates a new symbolic tensor where the updates are written to input according to indices. **Parameters:** * **input** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to write elements to. * **updates** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of elements to write to input. * **indices** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The positions in input to update. * **axis** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The axis along which indices indexes into. **Returns:** A new symbolic tensor representing the result of the scatter operation. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `select()` {#max.graph.ops.select} > max.graph.ops.select(cond, x, y) Returns `condition ? x : y` (element-wise), where `cond`, `x` and `y` are input tensors. **Parameters:** * **condition** – The condition tensor to use for selecting elementwise values. * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – If the condition is true at a position, the value from the same position in this tensor will be selected. * **y** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – If the condition is false at a position, the value from the same position in this tensor will be selected. * **cond** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Returns:** A new symbolic tensor holding either values from either `x` or `y`, based on the elements in condition. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ### `split()` {#max.graph.ops.split} > max.graph.ops.split(x, split\_sizes, axis=0) Splits the input tensor into multiple tensors along a given dimension. **Parameters:** * **x** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to split. * **split\_sizes** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – Sizes of each output tensor. Must add up to the split dimension axis. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Dimension to split the input tensor. **Returns:** A list of tensors with the same length as split\_sizes, where each tensor has the same shape as the input except along the split dimension axis, where the size is given by the corresponding element in split\_sizes. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)] ### `stack()` {#max.graph.ops.stack} > max.graph.ops.stack(values, axis=0) Stacks a list of tensors along a new axis. **Parameters:** * **values** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` `Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `]` ) – A list of symbolic tensor values. Each tensor must have the same dtype and rank, and must have the same dimension size for each dimension. * **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to concatenate along. If negative, indexes relative to the end of the tensor shape *plus 1*. For instance, `stack(vs, -1)` will create and stack along a new axis as the last dimension, aad `stack(vs, -2)` will create and stack along a new dimension which is inserted immediately before the last dimension. **Returns:** A new symbolic tensor representing the result of the stack. It will have rank `n+1` where `n` is the rank of each input tensor. Its size on each dimension other than `axis` will be the same as each input tensors’, with the new axis inserted. Along the new dimension it will have size `len(values)`. **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) ## Random operations ### `normal()` {#max.graph.ops.random.normal} > max.graph.ops.random.normal(like, mean=0.0, std=1.0) **Parameters:** * **like** ([`TensorType`](type.md#max.graph.type.TensorType) ) * **mean** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **std** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `|` [`Shape`](type.md#max.graph.type.Shape) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** [*TensorValue*](TensorValue.md#max.graph.TensorValue) --- ## optional Defines Optional, a type modeling a value which may or may not be present. Optional values can be thought of as a type-safe nullable pattern. Your value can take on a value or `None`, and you need to check and explicitly extract the value to get it out. Examples: ```mojo var a = Optional(1) var b = Optional[Int](None) if a: print(a.value()) # prints 1 if b: # Bool(b) is False, so no print print(b.value()) var c = a.or_else(2) var d = b.or_else(2) print(c) # prints 1 print(d) # prints 2 ``` ## Structs * [​`Optional`](/mojo/stdlib/collections/optional/Optional): A type modeling a value which may or may not be present. * [​`OptionalReg`](/mojo/stdlib/collections/optional/OptionalReg): A register-passable optional type. --- ## Optional `struct Optional[T: Copyable & Movable]` A type modeling a value which may or may not be present. Optional values can be thought of as a type-safe nullable pattern. Your value can take on a value or `None`, and you need to check and explicitly extract the value to get it out. Currently T is required to be a `Copyable & Movable` so we can implement copy/move for Optional and allow it to be used in collections itself. Examples: ```mojo var a = Optional(1) var b = Optional[Int](None) if a: print(a.value()) # prints 1 if b: # Bool(b) is False, so no print print(b.value()) var c = a.or_else(2) var d = b.or_else(2) print(c) # prints 1 print(d) # prints 2 ``` ## Parameters * ​T (`Copyable & Movable`): The type of value stored in the `Optional`. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Construct an empty `Optional`. `@implicit` `__init__(out self, owned value: T)` Construct an `Optional` containing a value. **Args:** * ​value (`T`): The value to store in the `Optional`. `@implicit` `__init__(out self, value: NoneType)` Construct an empty `Optional`. **Args:** * ​value (`NoneType`): Must be exactly `None`. ### `__bool__` `__bool__(self) -> Bool` Return true if the Optional has a value. **Returns:** True if the `Optional` has a value and False otherwise. ### `__getitem__` `__getitem__(ref self) -> ref [$1._value] T` Retrieve a reference to the value inside the `Optional`. **Returns:** A reference to the value inside the `Optional`. **Raises:** On empty `Optional`. ### `__invert__` `__invert__(self) -> Bool` Return False if the `Optional` has a value. **Returns:** False if the `Optional` has a value and True otherwise. ### `__eq__` `__eq__(self, rhs: NoneType) -> Bool` Return `True` if a value is not present. **Args:** * ​rhs (`NoneType`): The `None` value to compare to. **Returns:** `True` if a value is not present, `False` otherwise. `__eq__[T: EqualityComparable & Copyable & Movable](self: Optional[T], rhs: Optional[T]) -> Bool` Return `True` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Copyable`, `Movable` and `EqualityComparable`. **Args:** * ​rhs (`Optional[T]`): The value to compare to. **Returns:** True if the values are the same. ### `__ne__` `__ne__(self, rhs: NoneType) -> Bool` Return `True` if a value is present. **Args:** * ​rhs (`NoneType`): The `None` value to compare to. **Returns:** `False` if a value is not present, `True` otherwise. `__ne__[T: EqualityComparable & Copyable & Movable, //](self: Optional[T], rhs: Optional[T]) -> Bool` Return `False` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Copyable`, `Movable` and `EqualityComparable`. **Args:** * ​rhs (`Optional[T]`): The value to compare to. **Returns:** False if the values are the same. ### `__is__` `__is__(self, other: NoneType) -> Bool` Return `True` if the Optional has no value. Notes: It allows you to use the following syntax: `if my_optional is None:`. **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has no value and False otherwise. ### `__isnot__` `__isnot__(self, other: NoneType) -> Bool` Return `True` if the Optional has a value. Notes: It allows you to use the following syntax: `if my_optional is not None:`. **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has a value and False otherwise. ### `copy` `copy(self) -> Self` Copy construct an `Optional`. **Returns:** A copy of the value. ### `__str__` `__str__[U: Copyable & Movable & Representable, //](self: Optional[U]) -> String` Return the string representation of the value of the `Optional`. **Parameters:** * ​U (`Copyable & Movable & Representable`): The type of the elements in the list. Must implement the traits `Representable`, `Copyable` and `Movable`. **Returns:** A string representation of the `Optional`. ### `__repr__` `__repr__[U: Representable & Copyable & Movable, //](self: Optional[U]) -> String` Returns the verbose string representation of the `Optional`. **Parameters:** * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Representable`, `Copyable` and `Movable`. **Returns:** A verbose string representation of the `Optional`. ### `write_to` `write_to[W: Writer, U: Representable & Copyable & Movable, //](self: Optional[U], mut writer: W)` Write `Optional` string representation to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. * ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the traits `Representable`, `Copyable` and `Movable`. **Args:** * ​writer (`W`): The object to write to. ### `value` `value(ref self) -> ref [$1._value] T` Retrieve a reference to the value of the `Optional`. Notes: This will abort on empty `Optional`. **Returns:** A reference to the contained data of the `Optional` as a reference. ### `unsafe_value` `unsafe_value(ref self) -> ref [$1._value] T` Unsafely retrieve a reference to the value of the `Optional`. Notes: This will **not** abort on empty `Optional`. **Returns:** A reference to the contained data of the `Optional` as a reference. ### `take` `take(mut self) -> T` Move the value out of the `Optional`. Notes: This will abort on empty `Optional`. **Returns:** The contained data of the `Optional` as an owned T value. ### `unsafe_take` `unsafe_take(mut self) -> T` Unsafely move the value out of the `Optional`. Notes: This will **not** abort on empty `Optional`. **Returns:** The contained data of the `Optional` as an owned T value. ### `or_else` `or_else(self, default: T) -> T` Return the underlying value contained in the `Optional` or a default value if the `Optional`'s underlying value is not present. **Args:** * ​default (`T`): The new value to use if no value was present. **Returns:** The underlying value contained in the `Optional` or a default value. ### `copied` `copied[mut: Bool, origin: Origin[mut], //, T: Copyable & Movable](self: Optional[Pointer[T, origin]]) -> Optional[T]` Converts an `Optional` containing a Pointer to an `Optional` of an owned value by copying. Examples: Copy the value of an `Optional[Pointer[_]]` ```mojo var data = String("foo") var opt = Optional(Pointer(to=data)) var opt_owned: Optional[String] = opt.copied() ``` Notes: If `self` is an empty `Optional`, the returned `Optional` will be empty as well. **Parameters:** * ​mut (`Bool`): Mutability of the pointee origin. * ​origin (`Origin[mut]`): Origin of the contained `Pointer`. * ​T (`Copyable & Movable`): Type of the owned result value. **Returns:** An `Optional` containing an owned copy of the pointee value. --- ## OptionallyStaticInt ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Aliases ### `static_value` `alias static_value` ## Methods ### `__copyinit__` `__copyinit__(out self: _Self, existing: _Self, /)` Create a new instance of the value by copying an existing one. **Args:** * ​existing (`_Self`): The value to copy. ### `__moveinit__` `__moveinit__(out self: _Self, owned existing: _Self, /)` Create a new instance of the value by moving the value of another. **Args:** * ​existing (`_Self`): The value to move. ### `as_uint32` `as_uint32(self: _Self) -> SIMD[uint32, 1]` ### `__int__` `__int__(self: _Self) -> Int` Get the integral representation of the value. **Returns:** The integral representation of the value. --- ## OptionalReg `@register_passable(trivial)` `struct OptionalReg[T: AnyTrivialRegType]` A register-passable optional type. This struct optionally contains a value. It only works with trivial register passable types at the moment. ## Parameters * ​T (`AnyTrivialRegType`): The type of value stored in the Optional. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Create an optional with a value of None. `@implicit` `__init__(value: T) -> Self` Create an optional with a value. **Args:** * ​value (`T`): The value. `@implicit` `__init__(value: NoneType) -> Self` Create an optional without a value from a None literal. **Args:** * ​value (`NoneType`): The None value. ### `__bool__` `__bool__(self) -> Bool` Return true if the optional has a value. **Returns:** True if the optional has a value and False otherwise. ### `__is__` `__is__(self, other: NoneType) -> Bool` Return `True` if the Optional has no value. It allows you to use the following syntax: `if my_optional is None:` **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has no value and False otherwise. ### `__isnot__` `__isnot__(self, other: NoneType) -> Bool` Return `True` if the Optional has a value. It allows you to use the following syntax: `if my_optional is not None:` **Args:** * ​other (`NoneType`): The value to compare to (None). **Returns:** True if the Optional has a value and False otherwise. ### `value` `value(self) -> T` Get the optional value. **Returns:** The contained value. ### `or_else` `or_else(owned self, owned default: T) -> T` Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present. **Args:** * ​default (`T`): The new value to use if no value was present. **Returns:** The underlying value contained in the Optional or a default value. --- ## ord `ord(s: StringSlice[origin]) -> Int` Returns an integer that represents the codepoint of a single-character string. Given a string containing a single character `Codepoint`, return an integer representing the codepoint of that character. For example, `ord("a")` returns the integer `97`. This is the inverse of the `chr()` function. This function is in the prelude, so you don't need to import it. **Args:** * ​s (`StringSlice[origin]`): The input string, which must contain only a single- character. **Returns:** An integer representing the code point of the given character. --- ## Origin `@register_passable(trivial)` `struct Origin[mut: Bool]` This represents a origin reference for a memory value. ## Parameters * ​mut (`Bool`): Whether the origin is mutable. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `cast_from` `alias cast_from = _lit_mut_cast[mut, ?]` Cast an existing Origin to be of the specified mutability. This is a low-level way to coerce Origin mutability. This should be used rarely, typically when building low-level fundamental abstractions. Strongly consider alternatives before reaching for this "escape hatch". Safety: This is an UNSAFE operation if used to cast an immutable origin to a mutable origin. Examples: Cast a mutable origin to be immutable: ```mojo struct Container[mut: Bool, //, origin: Origin[mut]]: var data: Int fn imm_borrow(self) -> Container[ImmutableOrigin.cast_from[origin].result]: # ... ``` ### `empty` `alias empty = {}` An empty `__origin_of()` of the given mutability. The empty origin is guaranteed not to alias any existing origins. --- ## OrMask `@register_passable(trivial)` `struct OrMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]` Mask that's the OR of two masks. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")` ### `mask_out_of_bound` `alias mask_out_of_bound = get_vtable_entry(:trait S, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait T, "mask_out_of_bound")` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## os Provides access to operating-system dependent functionality. The types and functions in this package primarily provide operating-system independent access to operating-system dependent features, such as file systems and environment variables. For accessing files, see built-in [`open()`](/mojo/stdlib/builtin/file/open) function and the [`file`](/mojo/stdlib/builtin/file/) module. For manipulating file system paths, see the [`os.path`](/mojo/stdlib/os/path/) package for OS-independent path manipulation functions and the `pathlib` package for the [`Path`](/mojo/stdlib/pathlib/path/Path) struct, an abstraction for handling paths. ## Packages * [​`path`](/mojo/stdlib/os/path/): Provides a set of operating-system independent functions for manipulating file system paths. ## Modules * [​`atomic`](/mojo/stdlib/os/atomic/): Implements the `Atomic` struct. * [​`env`](/mojo/stdlib/os/env/): Provides functions for working with environment variables. * [​`fstat`](/mojo/stdlib/os/fstat/): Implements file system status operations. * [​`os`](/mojo/stdlib/os/os/): Provides functions to access operating-system dependent functionality, including file system operations. * [​`pathlike`](/mojo/stdlib/os/pathlike/): Implements the `PathLike` trait. --- ## os Provides functions to access operating-system dependent functionality, including file system operations. You can import a method from the `os` package. For example: ```mojo from os import listdir ``` ## Aliases ### `SEEK_CUR` `alias SEEK_CUR = __init__[__mlir_type.!pop.int_literal](1)` Seek from the current position. ### `SEEK_END` `alias SEEK_END = __init__[__mlir_type.!pop.int_literal](2)` Seek from the end of the file. ### `SEEK_SET` `alias SEEK_SET = __init__[__mlir_type.!pop.int_literal](0)` Seek from the beginning of the file. ### `sep` `alias sep = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()` ## Functions * [​`abort`](/mojo/stdlib/os/os/abort): Calls a target dependent trap instruction if available. * [​`getuid`](/mojo/stdlib/os/os/getuid): Retrieve the user ID of the calling process. * [​`listdir`](/mojo/stdlib/os/os/listdir): Gets the list of entries contained in the path provided. * [​`makedirs`](/mojo/stdlib/os/os/makedirs): Creates a specified leaf directory along with any necessary intermediate directories that don't already exist. * [​`mkdir`](/mojo/stdlib/os/os/mkdir): Creates a directory at the specified path. * [​`remove`](/mojo/stdlib/os/os/remove): Removes the specified file. * [​`removedirs`](/mojo/stdlib/os/os/removedirs): Removes a leaf directory and all empty intermediate ones. * [​`rmdir`](/mojo/stdlib/os/os/rmdir): Removes the specified directory. * [​`unlink`](/mojo/stdlib/os/os/unlink): Removes the specified file. --- ## os_is_linux `os_is_linux() -> Bool` Returns True if the host operating system is Linux. **Returns:** True if the host operating system is Linux and False otherwise. --- ## os_is_macos `os_is_macos() -> Bool` Returns True if the host operating system is macOS. **Returns:** True if the host operating system is macOS and False otherwise. --- ## os_is_windows `os_is_windows() -> Bool` Returns True if the host operating system is Windows. **Returns:** True if the host operating system is Windows and False otherwise. --- ## outer_product_acc `outer_product_acc(res: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], lhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Updates result tensor with the outer product of two vectors. Computes `res += outer(lhs, rhs)` where `lhs` and `rhs` are vectors and `res` is a matrix. **Constraints:** All tensors must have statically known shapes. `res` must be rank 2. `lhs` and `rhs` must be rank 1. `res.shape[0]` `==` `lhs.shape[0]` and `res.shape[1]` `==` `rhs.shape[0]`. **Args:** * ​res (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The result matrix to accumulate into, shape (M, N). * ​lhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The left-hand side vector, shape (M,). * ​rhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The right-hand side vector, shape (N,). --- ## owned_pointer Implements `OwnedPointer`, a safe, single-ownership smart pointer. You can import these APIs from the `memory` package. For example: ```mojo from memory import OwnedPointer ``` ## Structs * [​`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer): A safe, owning, smart pointer. --- ## OwnedKwargsDict `struct OwnedKwargsDict[V: Copyable & Movable]` Container used to pass owned variadic keyword arguments to functions. This type mimics the interface of a dictionary with `String` keys, and should be usable more-or-less like a dictionary. Notably, however, this type should not be instantiated directly by users. ## Parameters * ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be Copyable & Movable. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `key_type` `alias key_type = String` ## Methods ### `__init__` `__init__(out self)` Initialize an empty keyword dictionary. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy an existing keyword dictionary. **Args:** * ​existing (`Self`): The existing keyword dictionary. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Move data of an existing keyword dictionary into a new one. **Args:** * ​existing (`Self`): The existing keyword dictionary. ### `__getitem__` `__getitem__(self, key: String) -> V` Retrieve a value out of the keyword dictionary. **Args:** * ​key (`String`): The key to retrieve. **Returns:** The value associated with the key, if it's present. **Raises:** "KeyError" if the key isn't present. ### `__setitem__` `__setitem__(mut self, key: String, value: V)` Set a value in the keyword dictionary by key. **Args:** * ​key (`String`): The key to associate with the specified value. * ​value (`V`): The data to store in the dictionary. ### `__contains__` `__contains__(self, key: String) -> Bool` Check if a given key is in the keyword dictionary or not. **Args:** * ​key (`String`): The key to check. **Returns:** True if there key exists in the keyword dictionary, False otherwise. ### `copy` `copy(self) -> Self` Copy an existing keyword dictionary. **Returns:** A copy of the value. ### `__len__` `__len__(self) -> Int` The number of elements currently stored in the keyword dictionary. **Returns:** The number of elements currently stored in the keyword dictionary. ### `find` `find(self, key: String) -> Optional[V]` Find a value in the keyword dictionary by key. **Args:** * ​key (`String`): The key to search for in the dictionary. **Returns:** An optional value containing a copy of the value if it was present, otherwise an empty Optional. ### `pop` `pop(mut self, key: String, owned default: V) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`String`): The key to remove from the dictionary. * ​default (`V`): A default value to return if the key was not found instead of raising. **Returns:** The value associated with the key, if it was in the dictionary. If it wasn't, return the provided default value instead. `pop(mut self, key: String) -> V` Remove a value from the dictionary by key. **Args:** * ​key (`String`): The key to remove from the dictionary. **Returns:** The value associated with the key, if it was in the dictionary. Raises otherwise. **Raises:** "KeyError" if the key was not present in the dictionary. ### `__iter__` `__iter__(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]` Iterate over the keyword dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `keys` `keys(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]` Iterate over the keyword dict's keys as immutable references. **Returns:** An iterator of immutable references to the dictionary keys. ### `values` `values(ref self) -> _DictValueIter[String, V, self_is_origin._dict]` Iterate over the keyword dict's values as references. **Returns:** An iterator of references to the dictionary values. ### `items` `items(ref self) -> _DictEntryIter[String, V, self_is_origin._dict]` Iterate over the keyword dictionary's entries as immutable references. Examples: ```mojo var my_dict = Dict[String, Int]() my_dict["a"] = 1 my_dict["b"] = 2 for e in my_dict.items(): print(e[].key, e[].value) ``` Notes: These can't yet be unpacked like Python dict items, but you can access the key and value as attributes. **Returns:** An iterator of immutable references to the dictionary entries. --- ## OwnedPointer `@register_passable` `struct OwnedPointer[T: AnyType]` A safe, owning, smart pointer. This smart pointer is designed for cases where there is clear ownership of the underlying data, and restricts access to it through the origin system such that no more than one mutable alias for the underlying data may exist. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/) in the Mojo Manual. ## Parameters * ​T (`AnyType`): The type to be stored in the `OwnedPointer`. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__[T: Movable](owned value: T) -> OwnedPointer[T]` Construct a new `OwnedPointer` by moving the passed value into a new backing allocation. **Parameters:** * ​T (`Movable`): The type of the data to store. It is restricted to `Movable` here to allow efficient move construction. **Args:** * ​value (`T`): The value to move into the `OwnedPointer`. `__init__[T: ExplicitlyCopyable](*, copy_value: T) -> OwnedPointer[T]` Construct a new `OwnedPointer` by explicitly copying the passed value into a new backing allocation. **Parameters:** * ​T (`ExplicitlyCopyable`): The type of the data to store, which must be `ExplicitlyCopyable`. **Args:** * ​copy\_value (`T`): The value to explicitly copy into the `OwnedPointer`. `__init__[T: Copyable, U: NoneType = NoneType(None)](value: T) -> OwnedPointer[T]` Construct a new `OwnedPointer` by copying the passed value into a new backing allocation. **Parameters:** * ​T (`Copyable`): The type of the data to store. * ​U (`NoneType`): A dummy type parameter, to lower the selection priority of this ctor. **Args:** * ​value (`T`): The value to copy into the `OwnedPointer`. `__init__[T: ExplicitlyCopyable](*, other: OwnedPointer[T]) -> OwnedPointer[T]` Construct a new `OwnedPointer` by explicitly copying the value from another `OwnedPointer`. **Parameters:** * ​T (`ExplicitlyCopyable`): The type of the data to store. **Args:** * ​other (`OwnedPointer[T]`): The `OwnedPointer` to copy. ### `__del__` `__del__(owned self)` Destroy the OwnedPointer\[]. ### `__getitem__` `__getitem__(ref self) -> ref [self] T` Returns a reference to the pointers's underlying data with parametric mutability. **Returns:** A reference to the data underlying the `OwnedPointer`. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[T]` UNSAFE: returns the backing pointer for this `OwnedPointer`. **Returns:** An UnsafePointer to the backing allocation for this `OwnedPointer`. ### `take` `take[T: Movable](owned self: OwnedPointer[T]) -> T` Move the value within the `OwnedPointer` out of it, consuming the `OwnedPointer` in the process. **Parameters:** * ​T (`Movable`): The type of the data backing this `OwnedPointer`. `take()` only exists for `T: Movable` since this consuming operation only makes sense for types that you want to avoid copying. For types that are `Copyable` or `ExplicitlyCopyable` but are not `Movable`, you can copy them through `__getitem__` as in `var v = some_ptr_var[]`. **Returns:** The data that is (was) backing the `OwnedPointer`. ### `steal_data` `steal_data(owned self) -> UnsafePointer[T]` Take ownership over the heap allocated pointer backing this `OwnedPointer`. **Safety:** This function is not unsafe to call, as a memory leak is not considered unsafe. However, to avoid a memory leak, callers should ensure that the returned pointer is eventually deinitialized and deallocated. Failure to do so will leak memory. **Returns:** The pointer owned by this instance. --- ## Ownership A challenge you might face when using some programming languages is that you must manually allocate and deallocate memory. When multiple parts of the program need access to the same memory, it becomes difficult to keep track of who "owns" a value and determine when is the right time to deallocate it. If you make a mistake, it can result in a "use-after-free" error, a "double free" error, or a "leaked memory" error, any one of which can be catastrophic. Mojo helps avoid these errors by ensuring there is only one variable that owns each value at a time, while still allowing you to share references with other functions. When the life span of the owner ends, Mojo [destroys the value](/mojo/manual/lifecycle/death). Programmers are still responsible for making sure any type that allocates resources (including memory) also deallocates those resources in its destructor. Mojo's ownership system ensures that destructors are called promptly. On this page, we'll explain the rules that govern this ownership model, and how to specify different argument conventions that define how values are passed into functions. ## Ownership summary The fundamental rules that make Mojo's ownership model work are the following: * Every value has only one owner at a time. * When the lifetime of the owner ends, Mojo destroys the value. * If there are existing references to a value, Mojo extends the lifetime of the owner. ### Variables and references A variable *owns* its value. A struct owns its fields. A *reference* allows you to access a value owned by another variable. A reference can have either mutable access or immutable access to that value. Mojo references are created when you call a function: function arguments can be passed as mutable or immutable references. A function can also return a reference instead of returning a value. ## Argument conventions In all programming languages, code quality and performance is heavily dependent upon how functions treat argument values. That is, whether a value received by a function is a unique value or a reference, and whether it's mutable or immutable, has a series of consequences that define the readability, performance, and safety of the language. In Mojo, we want to provide full [value semantics](/mojo/manual/values/value-semantics) by default, which provides consistent and predictable behavior. But as a systems programming language, we also need to offer full control over memory optimizations, which generally requires reference semantics. The trick is to introduce reference semantics in a way that ensures all code is memory safe by tracking the lifetime of every value and destroying each one at the right time (and only once). All of this is made possible in Mojo through the use of argument conventions that ensure every value has only one owner at a time. An argument convention specifies whether an argument is mutable or immutable, and whether the function owns the value. Each convention is defined by a keyword at the beginning of an argument declaration: * `read`: The function receives an **immutable reference**. This means the function can read the original value (it is *not* a copy), but it cannot mutate (modify) it. Functions defined with `def` treat this differently, as described below in [Borrowed arguments](#borrowed-arguments-read). * `mut`: The function receives a **mutable reference**. This means the function can read and mutate the original value (it is *not* a copy). * `owned`: The function takes **ownership** of a value. This means the function has exclusive ownership of the argument. The caller might choose to transfer ownership of an existing value to this function, but that's not always what happens. The callee might receive a newly-created value, or a copy of an existing value. * `ref`: The function gets a reference with an parametric mutability: that is, the reference can be either mutable or immutable. You can think of `ref` arguments as a generalization of the `read` and `mut` conventions. `ref` arguments are an advanced topic, and they're described in more detail in [Lifetimes, origins, and references](/mojo/manual/values/lifetimes). * `out`: A special convention used for the `self` argument in [constructors](/mojo/manual/lifecycle/life#constructor) and for [named results](/mojo/manual/functions#named-results). An `out` argument is uninitialized at the beginning of the function, and must be initialized before the function returns. Although `out` arguments show up in the argument list, they're never passed in by the caller. For example, this function has one argument that's a mutable reference and one that's immutable: ```mojo fn add(mut x: Int, read y: Int): var x += y fn main(): var a = 1 var b = 2 add(a, b) print(a) ``` ```output 3 ``` You've probably already seen some function arguments that don't declare a convention. By default, all arguments are `read`. In the following sections, we'll explain each of these argument conventions in more detail. ## Borrowed arguments (`read`) The `read` convention is the default for all arguments. But as is described in [`def` and `fn` comparison](/mojo/manual/functions#def-and-fn-comparison), functions treat `read` arguments somewhat differently depending on whether the function is defined with `def` or `fn`: * When using `def`, if you mutate the value in the body of the function, the function receives a mutable copy of the argument. Otherwise, it receives an immutable reference. This allows you to treat arguments as mutable, but avoid the overhead of making extra copies when they're not needed. * When using `fn`, the function always receives an immutable reference. If you want a mutable copy, you can assign it to a local variable: ```mojo var my_copy = read_arg ``` In both cases, the original value on the caller side can't be changed by the callee. For example: ```mojo def print_list(list: List[Int]): print(list.__str__()) def main(): var values = List(1, 2, 3, 4) print_list(values) ``` ```output [1, 2, 3, 4] ``` Here the `list` argument to `print_list()` is read and not mutated, so the `print_list()` function gets an immutable reference to the original `List`, and doesn't do any copying. In general, passing an immutable reference is much more efficient when handling large or expensive-to-copy values, because the copy constructor and destructor are not invoked for a `read` argument. ### Compared to C++ and Rust Mojo's read argument convention is similar in some ways to passing an argument by `const&` in C++, which also avoids a copy of the value and disables mutability in the callee. However, the read convention differs from `const&` in C++ in two important ways: * The Mojo compiler implements a lifetime checker that ensures that values are not destroyed when there are outstanding references to those values. * Small values like `Int`, `Float`, and `SIMD` are passed directly in machine registers instead of through an extra indirection (this is because they are declared with the `@register_passable` decorator). This is a [significant performance enhancement](https://www.forrestthewoods.com/blog/should-small-rust-structs-be-passed-by-copy-or-by-borrow/) when compared to languages like C++ and Rust, and moves this optimization from every call site to a declaration on the type definition. The major difference between Rust and Mojo is that Mojo does not require a sigil on the caller side to pass by immutable reference. Also, Mojo is more efficient when passing small values, and Rust defaults to moving values instead of passing them around by borrow. These policy and syntax decisions allow Mojo to provide an easier-to-use programming model. ## Mutable arguments (`mut`) If you'd like your function to receive a **mutable reference**, add the `mut` keyword in front of the argument name. You can think of `mut` like this: it means any changes to the value *in*side the function are visible *out*side the function. For example, this `mutate()` function updates the original `list` value: ```mojo def print_list(list: List[Int]): print(list.__str__()) def mutate(mut l: List[Int]): l.append(5) def main(): var values = List(1, 2, 3, 4) mutate(values) print_list(values) ``` ```output [1, 2, 3, 4, 5] ``` That behaves like an optimized replacement for this: ```mojo def print_list(list: List[Int]): print(list.__str__()) def mutate_copy(l: List[Int]) -> List[Int]: # def creates an implicit copy of the list because it's mutated l.append(5) return l def main(): var values = List(1, 2, 3, 4) values = mutate_copy(values) print_list(values) ``` ```output [1, 2, 3, 4, 5] ``` Although the code using `mut` isn't that much shorter, it's more memory efficient because it does not make a copy of the value. However, remember that the values passed as `mut` must already be mutable. For example, if you try to take a `read` value and pass it to another function as `mut`, you'll get a compiler error because Mojo can't form a mutable reference from an immutable reference. :::note You cannot define [default values](/mojo/manual/functions#optional-arguments) for `mut` arguments. ::: ### Argument exclusivity Mojo enforces *argument exclusivity* for mutable references. This means that if a function receives a mutable reference to a value (such as an `mut` argument), it can't receive any other references to the same value—mutable or immutable. That is, a mutable reference can't have any other references that *alias* it. For example, consider the following code example: ```mojo fn append_twice(mut s: String, other: String): # Mojo knows 's' and 'other' cannot be the same string. s += other s += other fn invalid_access(): var my_string = String("o") # Create a run-time String value # error: passing `my_string` mut is invalid since it is also passed # read. append_twice(my_string, my_string) print(my_string) ``` This code is confusing because the user might expect the output to be `ooo`, but since the first addition mutates both `s` and `other`, the actual output would be `oooo`. Enforcing exclusivity of mutable references not only prevents coding errors, it also allows the Mojo compiler to optimize code in some cases. One way to avoid this issue when you do need both a mutable and an immutable reference (or need to pass the same value to two arguments) is to make a copy: ```mojo fn valid_access(): var my_string = String("o") # Create a run-time String value var other_string = String(my_string) # Create a copy of the String value append_twice(my_string, other_string) print(my_string) ``` Note that argument exclusivity isn't enforced for register-passable trivial types (like `Int` and `Bool`), because they are always passed by copy. When passing the same value into two `Int` arguments, the callee will receive two copies of the value. ## Transfer arguments (`owned` and `^`) And finally, if you'd like your function to receive value **ownership**, add the `owned` keyword in front of the argument name. This convention is often combined with use of the postfixed `^` "transfer" sigil on the variable that is passed into the function, which ends the lifetime of that variable. Technically, the `owned` keyword does not guarantee that the received value is *the original value*—it guarantees only that the function gets unique ownership of a value. This happens in one of three ways: * The caller passes the argument with the `^` transfer sigil, which ends the lifetime of that variable (the variable becomes uninitialized) and ownership is transferred into the function. * The caller **does not** use the `^` transfer sigil, in which case, Mojo copies the value. If the type isn't copyable, this is a compile-time error. * The caller passes in a newly-created "owned" value, such as a value returned from a function. In this case, no variable owns the value and it can be transferred directly to the callee. For example: ```mojo def take(owned s: String): pass def main(): take(String("A brand-new String!")) ``` The following code works by making a copy of the string, because `take_text()` uses the `owned` convention, and the caller does not include the transfer sigil: ```mojo fn take_text(owned text: String): text += "!" print(text) fn main(): var message = String("Hello") # Create a run-time String value take_text(message) print(message) ``` ```output Hello! Hello ``` However, if you add the `^` transfer sigil when calling `take_text()`, the compiler complains about `print(message)`, because at that point, the `message` variable is no longer initialized. That is, this version does not compile: ```mojo fn main(): var message = String("Hello") # Create a run-time String value take_text(message^) print(message) # error: use of uninitialized value 'message' ``` This is a critical feature of Mojo's lifetime checker, because it ensures that no two variables can have ownership of the same value. To fix the error, you must not use the `message` variable after you end its lifetime with the `^` transfer operator. So here is the corrected code: ```mojo fn take_text(owned text: String): text += "!" print(text) fn main(): var message = String("Hello") # Create a run-time String value take_text(message^) ``` ```output Hello! ``` Regardless of how it receives the value, when the function declares an argument as `owned`, it can be certain that it has unique mutable access to that value. Because the value is owned, the value is destroyed when the function exits—unless the function transfers the value elsewhere. For example, in the following example, `add_to_list()` takes a string and appends it to the list. Ownership of the string is transferred to the list, so it's not destroyed when the function exits. On the other hand, `consume_string()` doesn't transfer its `owned` value out, so the value is destroyed at the end of the function. ```mojo def add_to_list(owned name: String, mut list: List[String]): list.append(name^) # name is uninitialized, nothing to destroy def consume_string(owned s: String): print(s) # s is destroyed here ``` ### Transfer implementation details In Mojo, you shouldn't conflate "ownership transfer" with a "move operation"—these are not strictly the same thing. There are multiple ways that Mojo can transfer ownership of a value: * If a type implements the [move constructor](/mojo/manual/lifecycle/life#move-constructor), `__moveinit__()`, Mojo may invoke this method *if* a value of that type is transferred into a function as an `owned` argument, *and* the original variable's lifetime ends at the same point (with or without use of the `^` transfer sigil). * If a type implements the [copy constructor](/mojo/manual/lifecycle/life#move-constructor), `__copyinit__()` and not `__moveinit__()`, Mojo may copy the value and destroy the old value. * In some cases, Mojo can optimize away the move operation entirely, leaving the value in the same memory location but updating its ownership. In these cases, a value can be transferred without invoking either the `__copyinit__()` or `__moveinit__()` constructors. In order for the `owned` convention to work *without* the transfer sigil, the value type must be copyable (via `__copyinit__()`). ## Comparing `def` and `fn` argument conventions As mentioned in [Functions](/mojo/manual/functions), a function defined with `def` can treat a `read` argument as mutable, in which case it receives a mutable copy. An equivalent function defined with `fn` would need to make this copy explicit. For example, these two functions have the exact same behavior. ```mojo def def_example(a: Int, mut b: Int): pass fn fn_example(a_in: Int, mut b: Int): var a = a_in pass ``` This shadow copy typically adds little overhead, for small types. However, copying large types that allocate heap storage can be expensive. (For example, copying `List` or `Dict` types, or copying large numbers of strings.) ### `read` versus `owned` in `def` functions The difference between `read` and `owned` in a `def` function may be a little subtle. In both cases, you can end up with a uniquely-owned value that's a copy of the original value. * The `read` argument always gets an immutable reference or a local copy. You can't transfer a value into a `read` argument. * The `owned` argument always gets a uniquely owned value, which may have been copied or transferred from the callee. Using `owned` arguments without the transfer sigil (`^`) usually results in values being copied. --- ## pack_b `pack_b[transpose_b: Bool, simd_size: Int, inner_size: Int, a_type: DType, b_type: DType, c_type: DType, src_shape: DimList, dst_shape: DimList](dst: NDBuffer[b_type, 2, origin, dst_shape], src: NDBuffer[b_type, 2, origin, src_shape], tile_n: Int, tile_k: Int)` Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst. Tiles (not tile contents) are stored in row major order, so tile\[i, j] is tile\_n \* tile\_k bytes away from tile\[i, j+1]. --- ## pack_b_ndbuffer `pack_b_ndbuffer[b_mut: Bool, //, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, b_origin: Origin[b_mut], output_origin: MutableOrigin](b_input: NDBuffer[b_type, 2, b_origin, b_shape], output_buffer: NDBuffer[b_type, 2, output_origin])` --- ## pack_bits `pack_bits[width: Int, //, new_type: DType = uint8 if (width == 8) else uint16 if (width == 16) else uint32 if (width == 32) else uint64 if (width == 64) else ui128 if (width == 128) else ui256 if (width == 256) else invalid](val: SIMD[bool, width]) -> SIMD[new_type, 1]` Packs a SIMD vector of `bool` values into an integer. Examples: This example packs a vector of 8 `bool` values into a single 8-bit integer. ```mojo from memory import pack_bits flags = SIMD[DType.bool, 8](1, 1, 0, 1, 0, 0, 0, 0) i = pack_bits[DType.uint8](flags) print(flags, i) # [True, True, False, True, False, False, False, False] 11 ``` **Constraints:** The width of the bool vector must be the same as the bitwidth of the target type. **Parameters:** * ​width (`Int`): The source width. * ​new\_type (`DType`): The target type. **Args:** * ​val (`SIMD[bool, width]`): The source value. **Returns:** A new integer scalar which has the same bitwidth as the bool vector. --- ## pack_conv_filter_shape `pack_conv_filter_shape[single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]` Compute the output shape of convolution filter packing. **Parameters:** * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed. * ​num\_groups (`Int`): The number of groups in the convolution. **Returns:** The output shape. --- ## pack_filter `pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)` This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes. `pack_filter[simd_size: Int, micro_kernel_f_size: Int](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)` This packs the filter form RSCF to FRSCf. F is first broken down to segements of size micro\_kernel\_f\_size, then the remainder is further divided by simd\_size. The last residual elements if any is padded with zero to fill simd\_size. **Parameters:** * ​simd\_size (`Int`): Can differ from the simd size of the input type. * ​micro\_kernel\_f\_size (`Int`): The size of the last dimension in FRSCf, which is equals the size of the micro kernel's F dimension. **Args:** * ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Filter in RSCF layout (if 2D). * ​packed\_filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Packed filter in FRSCf layout (if 2D). F - the index of continuous segments in micro kernel. R, S, C - original R, S, C. f - the index within a continuous segments. * ​num\_groups (`Int`): The number of groups in the convolution. --- ## pack_filter `pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)` This packs the filter form RSFC to FRSCf. --- ## pack_filter_shape `pack_filter_shape[filter_type: DType, input_shape: DimList, filter_shape: DimList, output_shape: DimList, strides: DimList, dilations: DimList, paddings: DimList, num_groups: Int, single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> IndexList[(rank + 1)]` Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. **Returns:** The output shape. --- ## pack_filter_shape `pack_filter_shape(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]` Compute the output shape of transposed convolution filter packing. **Args:** * ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed. * ​num\_groups (`Int`): The number of groups in the convolution. **Returns:** The output shape. --- ## pack_filter_shape_impl `pack_filter_shape_impl[filter_type: DType](Q: Int, R: Int, S: Int, C: Int, F: Int, num_groups: Int) -> IndexList[6]` Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel. **Args:** * ​Q (`Int`): Original Q filter dimension. * ​R (`Int`): Original R filter dimension. * ​S (`Int`): Original S filter dimension. * ​C (`Int`): Original C filter dimension. * ​F (`Int`): Original F filter dimension. * ​num\_groups (`Int`): Number of groups in the convolution. **Returns:** The output shape. --- ## pack_matmul_b_shape_func `pack_matmul_b_shape_func[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, transpose_in_0: Bool, single_thread_blocking_override: Bool](b_input: NDBuffer[b_type, 2, origin, b_shape]) -> IndexList[2]` --- ## pack_Q_tile `pack_Q_tile(input: SIMD[uint8, 16]) -> SIMD[uint32, 4]` --- ## pack_transposed_b_ndbuffer `pack_transposed_b_ndbuffer[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList](b_input: NDBuffer[b_type, 2, origin, b_shape], output_buffer: NDBuffer[b_type, 2, origin])` --- ## packA_i8mm `packA_i8mm[a_type: DType](t0: Int, t1: Int, k: Int, a_ptr: UnsafePointer[SIMD[a_type, 1]], a_packed_ptr: UnsafePointer[SIMD[a_type, 1]])` --- ## packing ## Structs * [​`BTileGenerator`](./BTileGenerator): Struct to encapsulate a tile of B that supports prepacking. * [​`PackMatrixCols`](./PackMatrixCols): Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi]. * [​`PackMatrixRows`](./PackMatrixRows): Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi]. ## Functions * [​`pack_b`](./pack_b): Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst. * [​`pack_b_ndbuffer`](./pack_b_ndbuffer): * [​`pack_matmul_b_shape_func`](./pack_matmul_b_shape_func): * [​`pack_transposed_b_ndbuffer`](./pack_transposed_b_ndbuffer): --- ## PackMatrixCols `struct PackMatrixCols[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, column_inner_size: Int, use_vnni: Bool, use_i8mm: Bool, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]` Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi]. ## Fields * ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`): * ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`): * ​global\_offset (`IndexList[2]`): * ​pack\_tile\_dim (`IndexList[2]`): * ​valid\_data\_dim (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `run` `static run(packed_matrix: NDBuffer[type, 3, MutableAnyOrigin, packed_shape], original_matrix: NDBuffer[type, 2, MutableAnyOrigin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])` Interface function to run the packing routine. Args: packed\_matrix(NDBuffer): pre-allocated buffer space for packed data. original\_matrix(NDBuffer): data buffer containing the original matrix to pack. global\_offset(IndexList): offset to use when indexing the original matrix. pack\_tile\_dim(IndexList): 2D dimension tuple describing the size of the packed tile. valid\_data\_dim(IndexList): 2D dimension tuple describing the amount of valid data on the global buffer starting from the offset. --- ## PackMatrixRows `struct PackMatrixRows[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, row_inner_size: Int, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]` Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi]. ## Fields * ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`): * ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`): * ​global\_offset (`IndexList[2]`): * ​pack\_tile\_dim (`IndexList[2]`): * ​valid\_data\_dim (`IndexList[2]`): * ​valid\_simd\_dim (`IndexList[2]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `run` `static run(packed_matrix: NDBuffer[type, 3, packed_origin, packed_shape], original_matrix: NDBuffer[type, 2, original_origin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])` Interface function to run the packing routine. Args: packed\_matrix(NDBuffer): pre-allocated buffer space for packed data. original\_matrix(NDBuffer): data buffer containing the original matrix to pack. global\_offset(IndexList): offset to use when indexing the original matrix. pack\_tile\_dim(IndexList): 2D dimension tuple describing the size of the packed tile. valid\_data\_dim(IndexList): 2D dimension tuple describing the amount of valid data on the global buffer starting from the offset. --- ## pad ## Functions * [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. * [​`pad_reflect`](./pad_reflect): Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region. * [​`pad_repeat`](./pad_repeat): Fill `output` with values from `input`, and edges padded boundary values from the unpadded region. * [​`pad_shape`](./pad_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible. --- ## pad_constant `pad_constant[rank: Int, output_shape: DimList, input_shape: DimList, type: DType, paddings_type: DType, constant_type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], paddings: UnsafePointer[SIMD[paddings_type, 1]], constant: SIMD[constant_type, 1])` Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. Example: var input\_shape = (X, Y, Z) var paddings = [x0, x1, y0, y1, z0, z1] out\[x, y, z] = input\[x - x0, y - y0, z - z0] if x ∈ \[x0, x0 + X] && y ∈ \[y0, y0 + Y] && z ∈ \[z0, z0 + Z] else constant **Args:** * ​output (`NDBuffer[type, rank, origin, output_shape]`): The output buffer. * ​input (`NDBuffer[type, rank, origin, input_shape]`): The input buffer. * ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis. * ​constant (`SIMD[constant_type, 1]`): The constant to pad output with. --- ## pad_constant `pad_constant[rank: Int, type: DType, padding_type: DType](output: UnsafePointer[SIMD[type, 1]], output_shape: IndexList[rank], input: UnsafePointer[SIMD[type, 1]], input_shape: IndexList[rank], paddings: UnsafePointer[SIMD[padding_type, 1]], constant: SIMD[type, 1], ctx: DeviceContext)` Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. Example: ```mojo var input_shape = (X, Y, Z) var paddings = [x0, x1, y0, y1, z0, z1] out[x, y, z] = input[x - x0, y - y0, z - z0] if x ∈ [x0, x0 + X] && y ∈ [y0, y0 + Y] && z ∈ [z0, z0 + Z] else constant ``` **Args:** * ​output (`UnsafePointer[SIMD[type, 1]]`): The output buffer. * ​output\_shape (`IndexList[rank]`): The output shape. * ​input (`UnsafePointer[SIMD[type, 1]]`): The input buffer. * ​input\_shape (`IndexList[rank]`): The input shape. * ​paddings (`UnsafePointer[SIMD[padding_type, 1]]`): Ordered (before, after) padding sizes for each axis. * ​constant (`SIMD[type, 1]`): The constant to pad output with. * ​ctx (`DeviceContext`): Device context for participating GPU. --- ## pad_gpu ## Functions * [​`get_padding_output_shape`](./get_padding_output_shape): * [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`. --- ## pad_reflect `pad_reflect[rank: Int, output_shape: DimList, input_shape: DimList, type: DType, paddings_type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], paddings: UnsafePointer[SIMD[paddings_type, 1]])` Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region. Example: var input = [\[1, 2], \[3, 4]] var paddings = [2, 2, 1, 0] Yields: output = [\[2, 1, 2], \[4, 3, 4], \[2, 1, 2], \[4, 3, 4], \[2, 1, 2], \[4, 3, 4]] **Args:** * ​output (`NDBuffer[type, rank, origin, output_shape]`): The output buffer. * ​input (`NDBuffer[type, rank, origin, input_shape]`): The input buffer. * ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis. --- ## pad_repeat `pad_repeat[rank: Int, output_shape: DimList, input_shape: DimList, type: DType, paddings_type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], paddings: UnsafePointer[SIMD[paddings_type, 1]])` Fill `output` with values from `input`, and edges padded boundary values from the unpadded region. Example: var input = [\[1, 2], \[3, 4]] var paddings = [2, 2, 1, 0] Yields: output = [\[1, 1, 2], \[1, 1, 2], \[1, 1, 2], \[3, 3, 4], \[3, 3, 4], \[3, 3, 4]] **Parameters:** * ​rank (`Int`): Rank of the input/output buffers. * ​output\_shape (`DimList`): Dimensions of the output buffer. * ​input\_shape (`DimList`): Dimensions of the input buffer. * ​type (`DType`): DType of the input/output buffer. * ​paddings\_type (`DType`): DType of the input, output, and padding buffers. **Args:** * ​output (`NDBuffer[type, rank, origin, output_shape]`): The output buffer. * ​input (`NDBuffer[type, rank, origin, input_shape]`): The input buffer. * ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis. --- ## pad_shape `pad_shape[input_rank: Int, input_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]` Compute the output shape of a `pad` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​input\_type (`DType`): Type of the input tensor. * ​paddings\_type (`DType`): Type of the padding tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The tensor to pad. * ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings tensor, of shape (input\_rank, 2). **Returns:** The output shape. --- ## Padding tokens Padding tokens are extra tokens (usually zeros or special tokens) that are added to the input for a model so that the input matches the model's fixed input length or to ensure that all sequences in a [batch](batching.mdx) have the same length. In [transformer](transformer.mdx) models, padding tokens have been mostly replaced with [ragged tensors](ragged-tensors.mdx). --- ## PadHandling `@register_passable(trivial)` `struct PadHandling` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `EXCLUDE_PAD` `alias EXCLUDE_PAD = PadHandling(0)` ### `INCLUDE_PAD` `alias INCLUDE_PAD = PadHandling(2)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## PagedAttention PagedAttention is a memory management technique designed to improve GPU memory utilization during large language model (LLM) serving. Inspired by classical virtual memory and paging methods used in operating systems, PagedAttention divides the [KV cache](kv-cache.mdx) into fixed-size blocks, which are not necessarily stored contiguously in memory. This approach enables more efficient handling of dynamic states in LLMs, allowing the model to manage large context sizes while optimizing memory usage, as described in the 2023 paper [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) (Kwon, et al., 2023). Also written as "paged attention." --- ## PagedKVCache `@register_passable(trivial)` `struct PagedKVCache[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0]` The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention. ## Fields * ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `KVCacheT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `kv_params` `alias kv_params = kv_params_` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self` ### `max_tile_size` `static max_tile_size() -> Int` Returns the maximum tile size for the KVCache. ### `cache_lengths_nd` `cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` Returns the length of the cache for a given batch index. ### `load` `load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]` Loads an element from the given index. ### `store` `store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])` Stores an element at the given index. ### `empty_cache` `empty_cache(self) -> Bool` Returns true if the cache\_lengths for all requests is 0, false otherwise. ### `max_prompt_length` `max_prompt_length(self) -> SIMD[uint32, 1]` Returns the maximum sequence length across all batches of the current request. ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` Returns the maximum cache length used across all batches of the current request. ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]` --- ## PagedKVCacheCollection `struct PagedKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0]` ## Fields * ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`): * ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`): * ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`): * ​max\_seq\_length (`SIMD[uint32, 1]`): * ​max\_cache\_length (`SIMD[uint32, 1]`): * ​kv\_cache\_dynamic\_shape (`IndexList[4]`): * ​kv\_cache\_dynamic\_strides (`IndexList[4]`): ## Implemented traits `AnyType`, `Copyable`, `KVCollectionT`, `Movable`, `UnknownDestructibility` ## Aliases ### `blocks_shape` `alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))` ### `blocks_stride` `alias blocks_stride = _strides_from_shape[::DimList,::Int]()` ### `blocks_type` `alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]` ### `CacheType` `alias CacheType = PagedKVCache[type_, kv_params_, page_size, assert_write_mode]` ### `kv_params` `alias kv_params = kv_params_` ### `name_str` `alias name_str = "paged"` ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])` ### `__copyinit__` `__copyinit__(out self, other: Self)` ### `__moveinit__` `__moveinit__(out self, owned other: Self)` ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `get_key_cache` `get_key_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size, assert_write_mode]` ### `get_value_cache` `get_value_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size, assert_write_mode]` ### `cache_length` `cache_length(self, bs_idx: Int) -> Int` --- ## parallel_memcpy `parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int, count_per_task: Int, num_tasks: Int)` Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements. **Parameters:** * ​type (`DType`): The element dtype. **Args:** * ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination buffer. * ​src (`UnsafePointer[SIMD[type, 1]]`): The source buffer. * ​count (`Int`): Number of elements in the buffer. * ​count\_per\_task (`Int`): Task size. * ​num\_tasks (`Int`): Number of tasks to run in parallel. `parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int)` Copies `count` elements from a memory buffer `src` to `dest` in parallel. **Parameters:** * ​type (`DType`): The element type. **Args:** * ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination pointer. * ​src (`UnsafePointer[SIMD[type, 1]]`): The source pointer. * ​count (`Int`): The number of elements to copy. --- ## parallelism_level `parallelism_level() -> Int` Gets the parallelism level of the Runtime. **Returns:** The number of worker threads available in the async runtime. --- ## parallelize `parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)` Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. `parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int, num_workers: Int)` Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. * ​num\_workers (`Int`): The number of workers to use for execution. --- ## parallelize_over_rows `parallelize_over_rows[: origin.set, //, func: fn(Int, Int) capturing -> None](shape: IndexList[size, element_type=element_type], axis: Int, grain_size: Int)` Parallelize func over non-axis dims of shape. **Parameters:** * ​func (`fn(Int, Int) capturing -> None`): Function to call on range of rows. **Args:** * ​shape (`IndexList[size, element_type=element_type]`): Shape to parallelize over. * ​axis (`Int`): Rows are slices along the axis dimension of shape. * ​grain\_size (`Int`): The minimum number of elements to warrant using an additional thread. --- ## param_env Implements functions for retrieving compile-time defines. You can use these functions to set parameter values or runtime constants based on name-value pairs defined on the command line. For example: ```mojo from sys import is_defined alias float_type = DType.float32 if is_defined["FLOAT32"]() else DType.float64 # Use `float_type` as a constant. ``` And on the command line: ``` mojo -D FLOAT_32 main.mojo ``` For more information, see the [Mojo build docs](/mojo/cli/build.html#d-keyvalue). The `mojo run` command also supports the `-D` option. You can import these APIs from the `sys` package. For example: ```mojo from sys import is_defined ``` ## Functions * [​`env_get_bool`](/mojo/stdlib/sys/param_env/env_get_bool): Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`. * [​`env_get_dtype`](/mojo/stdlib/sys/param_env/env_get_dtype): Try to get an DType-valued define. If the name is not defined, return a default value instead. * [​`env_get_int`](/mojo/stdlib/sys/param_env/env_get_int): Try to get an integer-valued define. Compilation fails if the name is not defined. * [​`env_get_string`](/mojo/stdlib/sys/param_env/env_get_string): Try to get a string-valued define. Compilation fails if the name is not defined. * [​`is_defined`](/mojo/stdlib/sys/param_env/is_defined): Return true if the named value is defined. --- ## Parameterization: compile-time metaprogramming Many languages have facilities for *metaprogramming*: that is, for writing code that generates or modifies code. Python has facilities for dynamic metaprogramming: features like decorators, metaclasses, and many more. These features make Python very flexible and productive, but since they're dynamic, they come with runtime overhead. Other languages have static or compile-time metaprogramming features, like C preprocessor macros and C++ templates. These can be limiting and hard to use. To support Modular's work in AI, Mojo aims to provide powerful, easy-to-use metaprogramming with zero runtime cost. This compile-time metaprogramming uses the same language as runtime programs, so you don't have to learn a new language—just a few new features. The main new feature is *parameters*. You can think of a parameter as a compile-time variable that becomes a runtime constant. This usage of "parameter" is probably different from what you're used to from other languages, where "parameter" and "argument" are often used interchangeably. In Mojo, "parameter" and "parameter expression" refer to compile-time values, and "argument" and "expression" refer to runtime values. In Mojo, you can add parameters to a struct or function. You can also define named parameter expressions—aliases—that you can use as runtime constants. ## Parameterized functions To define a *parameterized function*, add parameters in square brackets ahead of the argument list. Each parameter is formatted just like an argument: a parameter name, followed by a colon and a type (which is required). In the following example, the function has a single parameter, `count` of type `Int`. ```mojo fn repeat[count: Int](msg: String): @parameter for i in range(count): print(msg) ``` The [`@parameter`](/mojo/manual/decorators/parameter) decorator shown here causes the `for` loop to be evaluated at compile time. The decorator only works if the loop limits are compile-time constants. Since `count` is a parameter, `range(count)` can be calculated at compile time. Calling a parameterized function, you provide values for the parameters, just like function arguments: ```mojo repeat[3]("Hello") ``` ```output Hello Hello Hello ``` The compiler resolves the parameter values during compilation, and creates a concrete version of the `repeat[]()` function for each unique parameter value. After resolving the parameter values and unrolling the loop, the `repeat[3]()` function would be roughly equivalent to this: ```mojo fn repeat_3(msg: String): print(msg) print(msg) print(msg) ``` :::note This doesn't represent actual code generated by the compiler. By the time parameters are resolved, Mojo code has already been transformed to an intermediate representation in [MLIR](https://mlir.llvm.org/). ::: If the compiler can't resolve all parameter values to constant values, compilation fails. ## Anatomy of a parameter list Parameters to a function or struct appear in square brackets after a function or struct name. Parameters always require type annotations. When you're looking at a function or struct definition, you may see some special characters such as `/` and `*` in the parameter list. Here's an example: ```mojo def my_sort[ # infer-only parameters Type: DType, width: Int, //, # positional-only parameter values: SIMD[Type, width], /, # positional-or-keyword parameter compare: fn (Scalar[Type], Scalar[Type]) -> Int, *, # keyword-only parameter reverse: Bool = False, ]() -> SIMD[Type, width]: ``` Here's a quick overview of the special characters in the parameter list: - Double slash (`//`): parameters declared before the double slash are [infer-only parameters](#infer-only-parameters). - Slash (`/`): parameters declared before a slash are positional-only parameters. Positional-only and keyword-only parameters follow the same rules as [positional-only and keyword-only arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments). - A parameter name prefixed with a star, like `*Types` identifies a [variadic parameter](#variadic-parameters) (not shown in the example above). Any parameters following the variadic parameter are keyword-only. - Star (`*`): in a parameter list with no variadic parameter, a star by itself indicates that the following parameters are keyword-only parameters. - An equals sign (`=`) introduces a default value for an [optional parameter](#optional-parameters-and-keyword-parameters). ## Parameters and generics "Generics" refers to functions that can act on multiple types of values, or containers that can hold multiple types of values. For example, [`List`](/mojo/stdlib/collections/list/List), can hold different types of values, so you can have a list of `Int` values, or a list of `String` values). In Mojo, generics use parameters to specify types. For example, `List` takes a type parameter, so a vector of integers is written `List[Int]`. So all generics use parameters, but **not** everything that uses parameters is a generic. For example, the `repeat[]()` function in the previous section includes parameter of type `Int`, and an argument of type `String`. It's parameterized, but not generic. A generic function or struct is parameterized on *type*. For example, we could rewrite `repeat[]()` to take any type of argument that conforms to the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait: ```mojo fn repeat[MsgType: Stringable, count: Int](msg: MsgType): @parameter for i in range(count): print(String(msg)) # Must use keyword parameter for `count` repeat[count=2](42) ``` ```output 42 42 ``` This updated function takes any `Stringable` type, so you can pass it an `Int`, `String`, or `Bool` value. You can't pass the `count` as a positional keyword without also specifying `MsgType`. You can put `//` after `MsgType` to specify that it's always inferred by the argument. Now you can pass the following parameter `count` positionally: ```mojo fn repeat[MsgType: Stringable, //, count: Int](msg: MsgType): @parameter for i in range(count): print(String(msg)) # MsgType is always inferred, so first positional keyword `2` is passed to `count` repeat[2](42) ``` ```output 42 42 ``` Mojo's support for generics is still early. You can write generic functions like this using traits and parameters. You can also write generic collections like `List` and `Dict`. If you're interested in learning how these types work, you can find the source code for the standard library collection types [on GitHub](https://github.com/modular/modular/blob/main/mojo/stdlib/src/collections/). ## Parameterized structs You can also add parameters to structs. You can use parameterized structs to build generic collections. For example, a generic array type might include code like this: ```mojo from memory import UnsafePointer struct GenericArray[ElementType: Copyable & Movable]: var data: UnsafePointer[ElementType] var size: Int fn __init__(out self, *elements: ElementType): self.size = len(elements) self.data = UnsafePointer[ElementType].alloc(self.size) for i in range(self.size): (self.data + i).init_pointee_move(elements[i]) fn __del__(owned self): for i in range(self.size): (self.data + i).destroy_pointee() self.data.free() fn __getitem__(self, i: Int) raises -> ref [self] ElementType: if (i Self: # Create a new array with count instances of the given value ``` Here, `Self` is equivalent to writing `GenericArray[ElementType]`. That is, you can call the `splat()` method like this: ```mojo GenericArray[Float64].splat(8, 0) ``` The method returns an instance of `GenericArray[Float64]`. ### Conditional conformance When creating a generic struct, you might want to define some methods that require extra features. For example, consider a collection like `GenericArray` that holds instances of a type that conforms to the [`Copyable`](/mojo/stdlib/builtin/value/Copyable) and [`Movable`](/mojo/stdlib/builtin/value/Movable) traits. This imposes a lot of limitations: you can't implement a `sort()` method because you can't guarantee that the stored type supports the comparison operators; you can't write a useful `__str__()` or `__repr__()` dunder method because you can't guarantee that the stored type supports conversion to a string. The answer to these issues is *conditional conformance*, which lets you define a method that requires additional features. You do this by defining the `self` value that has a more specific bound on one or more of its parameters. For example, the following code defines a `Container` type that holds an instance of a type conforming to `Copyable` and `Movable`. It also defines a `__str__()` method that can only be called if the stored `ElementType` conforms to `Writable`, `Copyable` and `Movable`: ```mojo @value struct Container[ElementType: Copyable & Movable]: var element: ElementType def __str__[StrElementType: Writable & Copyable & Movable, //]( self: Container[StrElementType]) -> String: return String(self.element) def use_container(): float_container = Container(5) string_container = Container("Hello") print(float_container.__str__()) print(string_container.__str__()) use_container() ``` ```output 5 Hello ``` Note the signature of the `__str__()` method, which declares the `self` argument with a more specific type. Specifically, it declares that it takes a `Container` with an `ElementType` that conforms to the `Writable`, `Copyable` and `Movable` traits. ```mojo def __str__[StrElementType: Writable & Copyable & Movable, //]( self: Container[StrElementType]) -> String: ``` This trait must be a superset of `ElementType`'s original trait: for example, the trait composition `Writable & Copyable & Movable` ensures that it includes all of the requirements of the original trait. Note that the `use_container()` function calls the `__str__()` method directly, rather than calling `String(float_container)`. One current limitation of conditional conformance is that Mojo can't recognize the struct `Container[Int]` as conforming to `Stringable`, even though the `__str__()` method is implemented for any `ElementType` that's also `Stringable`. ### Case study: the SIMD type For a real-world example of a parameterized type, let's look at the [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type from Mojo's standard library. [Single instruction, multiple data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) is a parallel processing technology built into many modern CPUs, GPUs, and custom accelerators. SIMD allows you to perform a single operation on multiple pieces of data at once. For example, if you want to take the square root of each element in an array, you can use SIMD to parallelize the work. Processors implement SIMD using low-level vector registers in hardware that hold multiple instances of a scalar data type. In order to use the SIMD instructions on these processors, the data must be shaped into the proper SIMD width (data type) and length (vector size). Processors may support 512-bit or longer SIMD vectors, and support many data types from 8-bit integers to 64-bit floating point numbers, so it's not practical to define all of the possible SIMD variations. Mojo's [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type (defined as a struct) exposes the common SIMD operations through its methods, and makes the SIMD data type and size values parametric. This allows you to directly map your data to the SIMD vectors on any hardware. Here's a cut-down (non-functional) version of Mojo's `SIMD` type definition: ```mojo struct SIMD[type: DType, size: Int]: var value: … # Some low-level MLIR stuff here # Create a new SIMD from a number of scalars fn __init__(out self, *elems: SIMD[type, 1]): ... # Fill a SIMD with a duplicated scalar value. @staticmethod fn splat(x: SIMD[type, 1]) -> SIMD[type, size]: ... # Cast the elements of the SIMD to a different elt type. fn cast[target: DType](self) -> SIMD[target, size]: ... # Many standard operators are supported. fn __add__(self, rhs: Self) -> Self: ... ``` So you can create and use a SIMD vector like this: ```mojo var vector = SIMD[DType.int16, 4](1, 2, 3, 4) vector = vector * vector for i in range(4): print(vector[i], end=" ") ``` ```output 1 4 9 16 ``` As you can see, a simple arithmetic operator like `*` applied to a pair of `SIMD` vector operates on the corresponding elements in each vector. Defining each SIMD variant with parameters is great for code reuse because the `SIMD` type can express all the different vector variants statically, instead of requiring the language to pre-define every variant. Because `SIMD` is a parameterized type, the `self` argument in its functions carries those parameters—the full type name is `SIMD[type, size]`. Although it's valid to write this out (as shown in the return type of `splat()`), this can be verbose, so we recommend using the `Self` type (from [PEP673](https://peps.python.org/pep-0673/)) like the `__add__` example does. ## Overloading on parameters Functions and methods can be overloaded on their parameter signatures. For information on overload resolution, see [Overloaded functions](/mojo/manual/functions#overloaded-functions). ## Using parameterized types and functions You can use parametric types and functions by passing values to the parameters in square brackets. For example, for the `SIMD` type above, `type` specifies the data type and `size` specifies the length of the SIMD vector (it must be a power of 2): ```mojo # Make a vector of 4 floats. var small_vec = SIMD[DType.float32, 4](1.0, 2.0, 3.0, 4.0) # Make a big vector containing 1.0 in float16 format. var big_vec = SIMD[DType.float16, 32](1.0) # Do some math and convert the elements to float32. var bigger_vec = (big_vec+big_vec).cast[DType.float32]() # You can write types out explicitly if you want of course. var bigger_vec2 : SIMD[DType.float32, 32] = bigger_vec print('small_vec type:', small_vec.element_type, 'length:', len(small_vec)) print('bigger_vec2 type:', bigger_vec2.element_type, 'length:', len(bigger_vec2)) ``` ```output small_vec type: float32 length: 4 bigger_vec2 type: float32 length: 32 ``` Note that the `cast()` method also needs a parameter to specify the type you want from the cast (the method definition above expects a `target` parametric value). Thus, just as the `SIMD` struct is a generic type definition, the `cast()` method is a generic method definition. At compile time, the compiler creates a concrete version of the `cast()` method with the target parameter bound to `DType.float32`. The code above shows the use of concrete types (that is, the parameters are all bound to known values). But the major power of parameters comes from the ability to define parametric algorithms and types (code that uses the parameter values). For example, here's how to define a parametric algorithm with `SIMD` that is type- and width-agnostic: ```mojo from math import sqrt fn rsqrt[dt: DType, width: Int](x: SIMD[dt, width]) -> SIMD[dt, width]: return 1 / sqrt(x) var v = SIMD[DType.float16, 4](42) print(rsqrt(v)) ``` ```output [0.154296875, 0.154296875, 0.154296875, 0.154296875] ``` Notice that the `x` argument is actually a `SIMD` type based on the function parameters. The runtime program can use the value of the parameters, because the parameters are resolved at compile-time before they are needed by the runtime program (but compile-time parameter expressions cannot use runtime values). ### Parameter inference The Mojo compiler can often *infer* parameter values, so you don't always have to specify them. For example, you can call the `rsqrt()` function defined above without any parameters: ```mojo var v = SIMD[DType.float16, 4](33) print(rsqrt(v)) ``` ```output [0.174072265625, 0.174072265625, 0.174072265625, 0.174072265625] ``` The compiler infers its parameters based on the parametric `v` value passed into it, as if you wrote `rsqrt[DType.float16, 4](v)` explicitly. Mojo can also infer the values of struct parameters from the arguments passed to a constructor or static method. For example, consider the following struct: ```mojo @value struct One[Type: Writable & Copyable & Movable]: var value: Type fn __init__(out self, value: Type): self.value = value def use_one(): s1 = One(123) s2 = One("Hello") ``` Note that you can create an instance of `One` without specifying the `Type` parameter—Mojo can infer it from the `value` argument. You can also infer parameters from a parameterized type passed to a constructor or static method: ```mojo struct Two[Type: Writable & Copyable & Movable]: var val1: Type var val2: Type fn __init__(out self, one: One[Type], another: One[Type]): self.val1 = one.value self.val2 = another.value print(String(self.val1), String(self.val2)) @staticmethod fn fire(thing1: One[Type], thing2: One[Type]): print("🔥", String(thing1.value), String(thing2.value)) def use_two(): s3 = Two(One("infer"), One("me")) Two.fire(One(1), One(2)) use_two() ``` ```output infer me 🔥 1 2 ``` `Two` takes a `Type` parameter, and its constructor takes values of type `One[Type]`. When constructing an instance of `Two`, you don't need to specify the `Type` parameter, since it can be inferred from the arguments. Similarly, the static `fire()` method takes values of type `One[Type]`, so Mojo can infer the `Type` value at compile time. :::note If you're familiar with C++, you may recognize this as similar to Class Template Argument Deduction (CTAD). ::: ## Optional parameters and keyword parameters Just as you can specify [optional arguments](/mojo/manual/functions#optional-arguments) in function signatures, you can also define an optional *parameter* by giving it a default value. You can also pass parameters by keyword, just like you can use [keyword arguments](/mojo/manual/functions#keyword-arguments). For a function or struct with multiple optional parameters, using keywords allows you to pass only the parameters you want to specify, regardless of their position in the function signature. For example, here's a function with two parameters, each with a default value: ```mojo fn speak[a: Int = 3, msg: StringLiteral = "woof"](): print(msg, a) fn use_defaults() raises: speak() # prints 'woof 3' speak[5]() # prints 'woof 5' speak[7, "meow"]() # prints 'meow 7' speak[msg="baaa"]() # prints 'baaa 3' ``` Recall that when a parametric function is called, Mojo can infer the parameter values. That is, it can use the parameter values attached to an argument value (see the `sqrt[]()` example above). If the parametric function also has a default value defined, then the inferred parameter type takes precedence. For example, in the following code, we update the parametric `speak[]()` function to take an argument with a parametric type. Although the function has a default parameter value for `a`, Mojo instead uses the inferred `a` parameter value from the `bar` argument (as written, the default `a` value can never be used, but this is just for demonstration purposes): ```mojo @value struct Bar[v: Int]: pass fn speak[a: Int = 3, msg: StringLiteral = "woof"](bar: Bar[a]): print(msg, a) fn use_inferred(): speak(Bar[9]()) # prints 'woof 9' ``` As mentioned above, you can also use optional parameters and keyword parameters in a struct: ```mojo struct KwParamStruct[greeting: String = "Hello", name: String = "🔥mojo🔥"]: fn __init__(out self): print(greeting, name) fn use_kw_params(): var a = KwParamStruct[]() # prints 'Hello 🔥mojo🔥' var b = KwParamStruct[name="World"]() # prints 'Hello World' var c = KwParamStruct[greeting="Hola"]() # prints 'Hola 🔥mojo🔥' ``` :::note Mojo supports positional-only and keyword-only parameters, following the same rules as [positional-only and keyword-only arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments). ::: ## Infer-only parameters Sometimes you need to declare functions where parameters depend on other parameters. Because the signature is processed left to right, a parameter can only *depend* on a parameter earlier in the parameter list. For example: ```mojo fn dependent_type[dtype: DType, value: Scalar[dtype]](): print("Value: ", value) print("Value is floating-point: ", dtype.is_floating_point()) dependent_type[DType.float64, Float64(2.2)]() ``` ```output Value: 2.2000000000000002 Value is floating-point: True ``` You can't reverse the position of the `dtype` and `value` parameters, because `value` depends on `dtype`. However, because `dtype` is a required parameter, you can't leave it out of the parameter list and let Mojo infer it from `value`: ```mojo dependent_type[Float64(2.2)]() # Error! ``` Infer-only parameters are a special class of parameters that are **always** either inferred from context or specified by keyword. Infer-only parameters are placed at the **beginning** of the parameter list, set off from other parameters by the `//` sigil: ```mojo fn example[type: Copyable & Movable, //, list: List[type]]() ``` Transforming `dtype` into an infer-only parameter solves this problem: ```mojo fn dependent_type[dtype: DType, //, value: Scalar[dtype]](): print("Value: ", value) print("Value is floating-point: ", dtype.is_floating_point()) dependent_type[Float64(2.2)]() ``` ```output Value: 2.2000000000000002 Value is floating-point: True ``` Because infer-only parameters are declared at the beginning of the parameter list, other parameters can depend on them, and the compiler will always attempt to infer the infer-only values from bound parameters or arguments. There are sometimes cases where it's useful to specify an infer-only parameter by keyword. For example, the [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) type is parametric on [origin](/mojo/manual/values/lifetimes): ```mojo struct StringSlice[mut: Bool, //, origin: Origin[mut]]: ... ``` Here, the `StringSlice` `mut` parameter is infer-only. The value is usually inferred when you create an instance of `StringSlice`. Binding the `mut` parameter by keyword lets you define a new type that's constrained to an immutable origin: ```mojo alias ImmutableStringSlice = StringSlice[mut=False] ``` If the compiler can't infer the value of an infer-only parameter, and it's not specified by keyword, compilation fails. ## Variadic parameters Mojo also supports variadic parameters, similar to [Variadic arguments](/mojo/manual/functions#variadic-arguments): ```mojo struct MyTensor[*dimensions: Int]: pass ``` Variadic parameters currently have some limitations that variadic arguments don't have: * Variadic parameters must be homogeneous—that is, all the values must be the same type. * The parameter type must be register-passable. * The parameter values aren't automatically projected into a `VariadicList`, so you need to construct the list explicitly: ```mojo fn sum_params[*values: Int]() -> Int: alias list = VariadicList(values) var sum = 0 for v in list: sum += v return sum ``` Variadic keyword parameters (for example, `**kwparams`) are not supported yet. ## Parameter expressions are just Mojo code A parameter expression is any code expression (such as `a+b`) that occurs where a parameter is expected. Parameter expressions support operators and function calls, just like runtime code, and all parameter types use the same type system as the runtime program (such as `Int` and `DType`). Because parameter expressions use the same grammar and types as runtime Mojo code, you can use many ["dependent type"](https://en.wikipedia.org/wiki/Dependent_type) features. For example, you might want to define a helper function to concatenate two SIMD vectors: ```mojo fn concat[ty: DType, len1: Int, len2: Int]( lhs: SIMD[ty, len1], rhs: SIMD[ty, len2]) -> SIMD[ty, len1+len2]: var result = SIMD[ty, len1 + len2]() for i in range(len1): result[i] = SIMD[ty, 1](lhs[i]) for j in range(len2): result[len1 + j] = SIMD[ty, 1](rhs[j]) return result var a = SIMD[DType.float32, 2](1, 2) var x = concat(a, a) print('result type:', x.element_type, 'length:', len(x)) ``` ```output result type: float32 length: 4 ``` Note that the resulting length is the sum of the input vector lengths, and this is expressed with a simple `+` operation. ### Powerful compile-time programming While simple expressions are useful, sometimes you want to write imperative compile-time logic with control flow. You can even do compile-time recursion. For instance, here is an example "tree reduction" algorithm that sums all elements of a vector recursively into a scalar: ```mojo fn slice[ty: DType, new_size: Int, size: Int]( x: SIMD[ty, size], offset: Int) -> SIMD[ty, new_size]: var result = SIMD[ty, new_size]() for i in range(new_size): result[i] = SIMD[ty, 1](x[i + offset]) return result fn reduce_add[ty: DType, size: Int](x: SIMD[ty, size]) -> Int: @parameter if size == 1: return Int(x[0]) elif size == 2: return Int(x[0]) + Int(x[1]) # Extract the top/bottom halves, add them, sum the elements. alias half_size = size // 2 var lhs = slice[ty, half_size, size](x, 0) var rhs = slice[ty, half_size, size](x, half_size) return reduce_add[ty, half_size](lhs + rhs) var x = SIMD[DType.index, 4](1, 2, 3, 4) print(x) print("Elements sum:", reduce_add(x)) ``` ```output [1, 2, 3, 4] Elements sum: 10 ``` This makes use of the [`@parameter`](/mojo/manual/decorators/parameter) decorator to create a parametric if condition, which is an `if` statement that runs at compile-time. It requires that its condition be a valid parameter expression, and ensures that only the live branch of the `if` statement is compiled into the program. (This is similar to use of the `@parameter` decorator with a `for` loop shown earlier.) ## `alias`: named parameter expressions It is very common to want to *name* compile-time values. Whereas `var` defines a runtime value, we need a way to define a compile-time temporary value. For this, Mojo uses an `alias` declaration. For example, the [`DType`](/mojo/stdlib/builtin/dtype/DType) struct implements a simple enum using aliases for the enumerators like this (the actual `DType` implementation details vary a bit): ```mojo struct DType: var value : UI8 alias invalid = DType(0) alias bool = DType(1) alias int8 = DType(2) alias uint8 = DType(3) alias int16 = DType(4) alias int16 = DType(5) ... alias float32 = DType(15) ``` This allows clients to use `DType.float32` as a parameter expression (which also works as a runtime value) naturally. Note that this is invoking the runtime constructor for `DType` at compile-time. Types are another common use for aliases. Because types are compile-time expressions, it is handy to be able to do things like this: ```mojo alias Float16 = SIMD[DType.float16, 1] alias UInt8 = SIMD[DType.uint8, 1] var x: Float16 = 0 # Float16 works like a "typedef" ``` Like `var` variables, aliases obey scope, and you can use local aliases within functions as you'd expect. ## Fully-bound, partially-bound, and unbound types A parametric type with its parameters specified is said to be *fully-bound*. That is, all of its parameters are bound to values. As mentioned before, you can only instantiate a fully-bound type (sometimes called a *concrete type*). However, parametric types can be *unbound* or *partially bound* in some contexts. For example, you can alias a partially-bound type to create a new type that requires fewer parameters: ```mojo alias StringKeyDict = Dict[String, _] var b: StringKeyDict[UInt8] = {"answer": 42} ``` Here, `StringKeyDict` is a type alias for a `Dict` that takes `String` keys. The underscore `_` in the parameter list indicates that the second parameter, `V` (the value type), is unbound. You specify the `V` parameter later, when you use `StringKeyDict`. For example, given the following type: ```mojo struct MyType[s: String, i: Int, i2: Int, b: Bool = True]: pass ``` It can appear in code in the following forms: * *Fully bound*, with all of its parameters specified: ```mojo MyType["Hello", 3, 4, True] ``` * *Partially bound*, with *some but not all* of its parameters specified: ```mojo MyType["Hola", _, _, True] ``` * *Unbound*, with no parameters specified: ```mojo MyType[_, _, _, _] ``` You can also use the star-underscore expression `*_` to unbind an arbitrary number of positional parameters at the end of a parameter list. ```mojo # These two types are equivalent MyType["Hello", *_] MyType["Hello", _, _, _] ``` The `*_` expression specifically matches any parameters that can be specified by position (positional-only or positional-or-keyword). To unbind keyword-only parameters, use the double-star-underscore expression, `**_`, which matches any parameters that can be specified by keyword (positional-or-keyword or keyword-only). ```mojo @value struct KeyWordStruct[pos_or_kw: Int, *, kw_only: Int = 10]: pass # Unbind both pos_or_kw and kw_only parameters fn use_kw_struct(k: KeyWordStruct[**_]): pass def main(): use_kw_struct(KeyWordStruct[10, kw_only=11]()) ``` When a parameter is explicitly unbound with the `_`, `*_`, or `**_` expressions, you **must** specify a value for that parameter to use the type. Any default value from the original type declaration is ignored. Partially-bound and unbound parametric types can be used in some contexts where the missing (unbound) parameters will be supplied later—such as in [aliases](#alias-named-parameter-expressions) and [automatically parameterized functions](#automatic-parameterization-of-functions). ### Omitted parameters Mojo also supports an alternate format for unbound parameter where the parameter is simply omitted from the expression: ```mojo # Partially bound MyType["Hi there"] # Unbound MyType ``` This format differs from the explicit unbinding syntax described above in that the default values for omitted parameters are bound immediately. For example, the following expressions are equivalent: ```mojo MyType["Hi there"] # equivalent to MyType["Hi there", _, _, True] # Uses the default value for `b` ``` :::note This format is currently supported for backwards compatibility. We intend to deprecate this format in the future in favor of the explicit unbinding syntax. ::: ## Automatic parameterization of functions Mojo supports "automatic" parameterization of functions. If a function argument type is a [partially-bound or unbound type](#fully-bound-partially-bound-and-unbound-types), the unbound parameters are automatically added as input parameters on the function. This is easier to understand with an example: ```mojo fn print_params(vec: SIMD[*_]): print(vec.type) print(vec.size) var v = SIMD[DType.float64, 4](1.0, 2.0, 3.0, 4.0) print_params(v) ``` ```output float64 4 ``` In the above example, the `print_params` function is automatically parameterized. The `vec` argument takes an argument of type `SIMD[*_]`. This is an [unbound parameterized type](#fully-bound-partially-bound-and-unbound-types)—that is, it doesn't specify any parameter values for the type. Mojo treats the unbound parameters on `vec` as infer-only parameters on the function. This is roughly equivalent to the following codes: ```mojo fn print_params[t: DType, s: Int, //](vec: SIMD[t, s]): print(vec.type) print(vec.size) ``` When you call `print_params()` you must pass it a concrete instance of the `SIMD` type—that is, one with all of its parameters specified, like `SIMD[DType.float64, 4]`. The Mojo compiler *infers* the parameter values from the input argument. With a manually parameterized function, you can access the input parameters by name (for example, `t` and `s` in the previous example). For an automatically parameterized function, you can access the parameters as attributes on the argument (for example, `vec.type`). This ability to access a type's input parameters is not specific to automatically parameterized functions, you can use it anywhere. You can access the input parameters of a parameterized type as attributes on the type itself: ```mojo fn on_type(): print(SIMD[DType.float32, 2].size) # prints 2 ``` Or as attributes on an *instance* of the type: ```mojo fn on_instance(): var x = SIMD[DType.int32, 2](4, 8) print(x.type) # prints int32 ``` You can even use this syntax in the function's signature to define a function's arguments and return type based on an argument's parameters. For example, if you want your function to take two SIMD vectors with the same type and size, you can write code like this: ```mojo fn interleave(v1: SIMD, v2: __type_of(v1)) -> SIMD[v1.type, v1.size*2]: var result = SIMD[v1.type, v1.size*2]() for i in range(v1.size): result[i*2] = SIMD[v1.type, 1](v1[i]) result[i*2+1] = SIMD[v1.type, 1](v2[i]) return result var a = SIMD[DType.int16, 4](1, 2, 3, 4) var b = SIMD[DType.int16, 4](0, 0, 0, 0) var c = interleave(a, b) print(c) ``` ```output [1, 0, 2, 0, 3, 0, 4, 0] ``` As shown in the example, you can use the magic `__type_of(x)` call if you just want to match the type of an argument. In this case, it's more convenient and compact that writing the equivalent `SIMD[v1.type, v1.size]`. ### Automatic parameterization of parameters You can also take advantage of automatic parameterization in a function's parameter list. For example: ```mojo fn foo[value: SIMD](): pass # Equivalent to: fn foo[type: DType, size: Int, //, value: SIMD[type, size]](): pass ``` ### Automatic parameterization with partially-bound types Mojo also supports automatic parameterization: with [partially-bound parameterized types](#fully-bound-partially-bound-and-unbound-types) (that is, types with some but not all of the parameters specified). For example, suppose we have a `Fudge` struct with three parameters: ```mojo @value struct Fudge[sugar: Int, cream: Int, chocolate: Int = 7](Stringable): fn __str__(self) -> String: return String.write("Fudge (", sugar, ",", cream, ",", chocolate, ")") ``` We can write a function that takes a `Fudge` argument with just one bound parameter (it's *partially bound*): ```mojo fn eat(f: Fudge[5, *_]): print("Ate " + String(f)) ``` The `eat()` function takes a `Fudge` struct with the first parameter (`sugar`) bound to the value 5. The second and third parameters, `cream` and `chocolate` are unbound. The unbound `cream` and `chocolate` parameters become implicit input parameters on the `eat` function. In practice, this is roughly equivalent to writing: ```mojo fn eat[cr: Int, ch: Int](f: Fudge[5, cr, ch]): print("Ate", String(f)) ``` In both cases, we can call the function by passing in an instance with the `cream` and `chocolate` parameters bound: ```mojo eat(Fudge[5, 5, 7]()) eat(Fudge[5, 8, 9]()) ``` ```output Ate Fudge (5,5,7) Ate Fudge (5,8,9) ``` If you try to pass in an argument with a `sugar` value other than 5, compilation fails, because it doesn't match the argument type: ```mojo eat(Fudge[12, 5, 7]()) # ERROR: invalid call to 'eat': argument #0 cannot be converted from 'Fudge[12, 5, 7]' to 'Fudge[5, 5, 7]' ``` You can also explicitly unbind individual parameters. This gives you more freedom in specifying unbound parameters. For example, you might want to let the user specify values for `sugar` and `chocolate`, and leave `cream` constant. To do this, replace each unbound parameter value with a single underscore (`_`): ```mojo fn devour(f: Fudge[_, 6, _]): print("Devoured", String(f)) ``` Again, the unbound parameters (`sugar` and `chocolate`) are added as implicit input parameters on the function. This version is roughly equivalent to the following code, where these two values are explicitly bound to the input parameters, `su` and `ch`: ```mojo fn devour[su: Int, ch: Int](f: Fudge[su, 6, ch]): print("Devoured", String(f)) ``` You can also specify parameters by keyword, or mix positional and keyword parameters, so the following function is roughly equivalent to the previous one: the first parameter, `sugar` is explicitly unbound with the underscore character. The `chocolate` parameter is unbound using the keyword syntax, `chocolate=_`. And `cream` is explicitly bound to the value 6: ```mojo fn devour(f: Fudge[_, chocolate=_, cream=6]): print("Devoured", String(f)) ``` All three versions of the `devour()` function work with the following calls: ```mojo devour(Fudge[3, 6, 9]()) devour(Fudge[4, 6, 8]()) ``` ```output Devoured Fudge (3,6,9) Devoured Fudge (4,6,8) ``` ### Legacy syntax (omitted parameters) You can also specify an unbound or partially-bound type by omitting parameters: for example: ```mojo fn nibble(f: Fudge[5]): print("Ate", String(f)) nibble(Fudge[5, 4, 7]()) ``` ```output Ate Fudge (5,4,7) ``` Here, `Fudge[5]` works like `Fudge[5, *_]` **except** in the handling of parameters with default values. Instead of discarding the default value of `chocolate`, `Fudge[5]` binds the default value immediately, making it equivalent to: `Fudge[5, _, 7]`. This means that the following code won't compile with the previous definition for the `nibble()` function, since it doesn't use the default value for `chocolate`: ```mojo nibble(Fudge[5, 5, 9]()) # ERROR: invalid call to 'nibble': argument #0 cannot be converted from 'Fudge[5, 5, 9]' to 'Fudge[5, 5, 7]' ``` :::note TODO Support for omitting unbound parameters will eventually be deprecated in favor of explicitly unbound parameters using `_` and `*_`. ::: ## The `rebind()` builtin One of the consequences of Mojo not performing function instantiation in the parser like C++ is that Mojo cannot always figure out whether some parametric types are equal and complain about an invalid conversion. This typically occurs in static dispatch patterns. For example, the following code won't compile: ```mojo fn take_simd8(x: SIMD[DType.float32, 8]): pass fn generic_simd[nelts: Int](x: SIMD[DType.float32, nelts]): @parameter if nelts == 8: take_simd8(x) ``` The parser will complain: ```plaintext error: invalid call to 'take_simd8': argument #0 cannot be converted from 'SIMD[f32, nelts]' to 'SIMD[f32, 8]' take_simd8(x) ~~~~~~~~~~^~~ ``` This is because the parser fully type-checks the function without instantiation, and the type of `x` is still `SIMD[f32, nelts]`, and not `SIMD[f32, 8]`, despite the static conditional. The remedy is to manually "rebind" the type of `x`, using the `rebind` builtin, which inserts a compile-time assert that the input and result types resolve to the same type after function instantiation: ```mojo fn take_simd8(x: SIMD[DType.float32, 8]): pass fn generic_simd[nelts: Int](x: SIMD[DType.float32, nelts]): @parameter if nelts == 8: take_simd8(rebind[SIMD[DType.float32, 8]](x)) ``` --- ## partial_simd_load `partial_simd_load[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, pad_value: SIMD[type, 1]) -> SIMD[type, width]` Loads a vector with dynamic bound. Out of bound data will be filled with pad value. Data is valid if lbound type (`DType`): The DType of storage. * ​width (`Int`): The system simd vector size. **Args:** * ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load. * ​lbound (`Int`): Lower bound of valid index within simd (inclusive). * ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive). * ​pad\_value (`SIMD[type, 1]`): Value to fill for out of bound indices. **Returns:** The SIMD vector loaded and zero-filled. --- ## partial_simd_store `partial_simd_store[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, data: SIMD[type, width])` Stores a vector with dynamic bound. Out of bound data will ignored. Data is valid if lbound type (`DType`): The DType of storage. * ​width (`Int`): The system simd vector size. **Args:** * ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load. * ​lbound (`Int`): Lower bound of valid index within simd (inclusive). * ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive). * ​data (`SIMD[type, width]`): The vector value to store. --- ## partition `partition[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool](span: Span[T, origin], k: Int)` Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is T (`Copyable & Movable`): Type of the underlying data. * ​origin (`MutableOrigin`): Origin of span. * ​cmp\_fn (`fn(T, T) capturing -> Bool`): Comparison functor of (T, T) capturing \[\_] -> Bool type. **Args:** * ​span (`Span[T, origin]`): Input buffer. * ​k (`Int`): Index of the partition element. --- ## partition_work `partition_work(task_id: Int, num_tasks: Int, work: Int, work_block_size: Int) -> IndexList[2]` --- ## Passwd `struct Passwd` Represents user account information retrieved from the user password database related to a user ID. ## Fields * ​pw\_name (`String`): User name. * ​pw\_passwd (`String`): User password. * ​pw\_uid (`Int`): User ID. * ​pw\_gid (`Int`): Group ID. * ​pw\_gecos (`String`): Real name or comment field. * ​pw\_dir (`String`): Home directory. * ​pw\_shell (`String`): Shell program. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Gets the Passwd struct as a string. **Returns:** A compact string of the Passwd struct. ### `__repr__` `__repr__(self) -> String` Gets the Passwd struct as a string. **Returns:** A compact string representation of Passwd struct. --- ## path Provides a set of operating-system independent functions for manipulating file system paths. ## Modules * [​`path`](/mojo/stdlib/os/path/path/): Provides a set of operating-system independent functions for manipulating file system paths. --- ## path Provides a set of operating-system independent functions for manipulating file system paths. You can import these APIs from the `os.path` package. For example: ```mojo from os.path import isdir ``` ## Functions * [​`basename`](/mojo/stdlib/os/path/path/basename): Returns the tail section of a path. * [​`dirname`](/mojo/stdlib/os/path/path/dirname): Returns the directory component of a pathname. * [​`exists`](/mojo/stdlib/os/path/path/exists): Return True if path exists. * [​`expanduser`](/mojo/stdlib/os/path/path/expanduser): Expands a tilde "\~" prefix in `path` to the user's home directory. * [​`expandvars`](/mojo/stdlib/os/path/path/expandvars): Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged. * [​`getsize`](/mojo/stdlib/os/path/path/getsize): Return the size, in bytes, of the specified path. * [​`is_absolute`](/mojo/stdlib/os/path/path/is_absolute): Return True if `path` is an absolute path name. On Unix, that means it begins with a slash. * [​`isdir`](/mojo/stdlib/os/path/path/isdir): Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path. * [​`isfile`](/mojo/stdlib/os/path/path/isfile): Test whether a path is a regular file. * [​`islink`](/mojo/stdlib/os/path/path/islink): Return True if path refers to an existing directory entry that is a symbolic link. * [​`join`](/mojo/stdlib/os/path/path/join): Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded. An empty last part will result in a path that ends with a separator. * [​`lexists`](/mojo/stdlib/os/path/path/lexists): Return True if path exists or is a broken symlink. * [​`split`](/mojo/stdlib/os/path/path/split): Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory. * [​`split_extension`](/mojo/stdlib/os/path/path/split_extension): Splits `path` into the root and extension. * [​`splitroot`](/mojo/stdlib/os/path/path/splitroot): Splits `path` into drive, root and tail. The tail contains anything after the root. --- ## path Implements `Path` and related functions. ## Aliases ### `DIR_SEPARATOR` `alias DIR_SEPARATOR = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()` ## Structs * [​`Path`](/mojo/stdlib/pathlib/path/Path): The Path object. ## Functions * [​`cwd`](/mojo/stdlib/pathlib/path/cwd): Gets the current directory. --- ## Path `struct Path` The Path object. ## Fields * ​path (`String`): The underlying path string representation. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Hashable`, `KeyElement`, `Movable`, `PathLike`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Methods ### `__init__` `__init__(out self)` Initializes a path with the current directory. `__init__(out self, path: StringSlice[origin])` Initializes a path with the provided path. **Args:** * ​path (`StringSlice[origin]`): The file system path. `@implicit` `__init__(out self, owned path: String)` Initializes a path with the provided path. **Args:** * ​path (`String`): The file system path. `@implicit` `__init__(out self, path: StringLiteral[value])` Initializes a path with the provided path. **Args:** * ​path (`StringLiteral[value]`): The file system path. ### `__bool__` `__bool__(self) -> Bool` Checks if the path is not empty. **Returns:** True if the path length is greater than zero, and False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Returns True if the two paths are equal. **Args:** * ​other (`Self`): The other path to compare against. **Returns:** True if the paths are equal and False otherwise. `__eq__(self, other: StringSlice[origin]) -> Bool` Returns True if the two paths are equal. **Args:** * ​other (`StringSlice[origin]`): The other path to compare against. **Returns:** True if the String and Path are equal, and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Returns True if the two paths are not equal. **Args:** * ​other (`Self`): The other path to compare against. **Returns:** True if the paths are not equal and False otherwise. ### `__truediv__` `__truediv__(self, suffix: Self) -> Self` Joins two paths using the system-defined path separator. **Args:** * ​suffix (`Self`): The suffix to append to the path. **Returns:** A new path with the suffix appended to the current path. `__truediv__(self, suffix: StringSlice[origin]) -> Self` Joins two paths using the system-defined path separator. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to append to the path. **Returns:** A new path with the suffix appended to the current path. ### `__itruediv__` `__itruediv__(mut self, suffix: StringSlice[origin])` Joins two paths using the system-defined path separator. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to append to the path. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Returns a string representation of the path. **Returns:** A string representation of the path. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this path to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__fspath__` `__fspath__(self) -> String` Returns a string representation of the path. **Returns:** A string representation of the path. ### `__repr__` `__repr__(self) -> String` Returns a printable representation of the path. **Returns:** A printable representation of the path. ### `__hash__` `__hash__(self) -> UInt` Hash the underlying path string using builtin hash. **Returns:** An integer value containing the hash of the path string. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the path string value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `stat` `stat(self) -> stat_result` Returns the stat information on the path. **Returns:** A stat\_result object containing information about the path. ### `lstat` `lstat(self) -> stat_result` Returns the lstat information on the path. This is similar to stat, but if the file is a symlink then it gives you information about the symlink rather than the target. **Returns:** A stat\_result object containing information about the path. ### `exists` `exists(self) -> Bool` Returns True if the path exists and False otherwise. **Returns:** True if the path exists on disk and False otherwise. ### `expanduser` `expanduser(self) -> Self` Expands a prefixed `~` with `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set or the `path` is not prefixed with `~`, returns the `path` unmodified. **Returns:** The expanded path. ### `home` `static home() -> Self` Returns `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set it returns `~`. **Returns:** Path to user home directory. ### `is_dir` `is_dir(self) -> Bool` Returns True if the path is a directory and False otherwise. **Returns:** Return True if the path points to a directory (or a link pointing to a directory). ### `is_file` `is_file(self) -> Bool` Returns True if the path is a file and False otherwise. **Returns:** Return True if the path points to a file (or a link pointing to a file). ### `read_text` `read_text(self) -> String` Returns content of the file. **Returns:** Contents of file as string. ### `read_bytes` `read_bytes(self) -> List[SIMD[uint8, 1]]` Returns content of the file as bytes. **Returns:** Contents of file as list of bytes. ### `write_text` `write_text[T: Writable](self, value: T)` Writes the value to the file as text. **Parameters:** * ​T (`Writable`): The type of an object conforming to the `Writable` trait. **Args:** * ​value (`T`): The value to write. ### `write_bytes` `write_bytes(self, bytes: Span[SIMD[uint8, 1], origin])` Writes bytes to the file. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to write to this file. ### `suffix` `suffix(self) -> String` The path's extension, if any. This includes the leading period. For example: '.txt'. If no extension is found, returns the empty string. **Returns:** The path's extension. ### `joinpath` `joinpath(self, *pathsegments: String) -> Self` Joins the Path using the pathsegments. **Args:** * ​\*pathsegments (`String`): The path segments. **Returns:** The path concatenation with the pathsegments using the directory separator. ### `listdir` `listdir(self) -> List[Path]` Gets the list of entries contained in the path provided. **Returns:** The list of entries in the path provided. --- ## pathlib Implements the pathlib package. ## Modules * [​`path`](/mojo/stdlib/pathlib/path/): Implements `Path` and related functions. --- ## pathlike Implements the `PathLike` trait. You can import the trait from the `os` package. For example: ```mojo from os import PathLike ``` ## Traits * [​`PathLike`](/mojo/stdlib/os/pathlike/PathLike): A trait representing file system paths. --- ## PathLike A trait representing file system paths. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__fspath__` `__fspath__(self: _Self) -> String` Return the file system path representation of the object. **Returns:** The file system path representation as a string. --- ## PDL `struct PDL` Programmatic Dependency Launch (PDL) control structure. This struct provides a way to manage programmatic stream serialization on NVIDIA GPUs. It includes functions for launching dependent grids and waiting for them to complete. Note: * Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize the PDL control structure. ### `__enter__` `__enter__(self)` Launch dependent grids that were previously configured to depend on the current grid. ### `__exit__` `__exit__(self)` Wait for all dependent grids launched by this grid to complete execution. --- ## PDLLevel `@register_passable(trivial)` `struct PDLLevel` Programmatic Dependency Launch (PDL) level. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `NO_WAIT_OVERLAP_AT_END` `alias NO_WAIT_OVERLAP_AT_END = PDLLevel(3)` ### `OFF` `alias OFF = PDLLevel(0)` ### `OVERLAP_AT_BEGINNING` `alias OVERLAP_AT_BEGINNING = PDLLevel(2)` ### `OVERLAP_AT_END` `alias OVERLAP_AT_END = PDLLevel(1)` ## Methods ### `__init__` `__init__() -> Self` Initialize the PDL level to OFF. `__init__(level: Int) -> Self` Initialize the PDL level. **Args:** * ​level (`Int`): The PDL level to initialize. ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if the PDL level is equal to another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is equal to the other PDL level, False otherwise. `__eq__(self, other: Int) -> Bool` Check if the PDL level is equal to another PDL level. **Args:** * ​other (`Int`): The other PDL level to compare against. **Returns:** True if the PDL level is equal to the other PDL level, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Check if the PDL level is not equal to another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is not equal to the other PDL level, False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Check if the PDL level is greater than another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is greater than the other PDL level, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Check if the PDL level is greater than or equal to another PDL level. **Args:** * ​other (`Self`): The other PDL level to compare against. **Returns:** True if the PDL level is greater or equal to the other PDL level, False otherwise. --- ## per_channel_grouped_4bit ## Structs * [​`block_Q4_K`](./block_Q4_K): * [​`block_Q6_K`](./block_Q6_K): * [​`block_QK_K`](./block_QK_K): * [​`Q4sym`](./Q4sym): Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor. ## Functions * [​`calculate_symmetric_vector`](./calculate_symmetric_vector): Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`. * [​`q4_k_dequantize_impl`](./q4_k_dequantize_impl): * [​`q6_k_dequantize_impl`](./q6_k_dequantize_impl): * [​`scale_min_k4`](./scale_min_k4): --- ## perf_counter `perf_counter() -> SIMD[float64, 1]` Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. **Returns:** The current time in ns. --- ## perf_counter_ns `perf_counter_ns() -> UInt` Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. **Returns:** The current time in ns. --- ## pipeline Hugging Face Token Generation Pipeline. ## `KVCacheMixin` {#max.pipelines.lib.pipeline.KVCacheMixin} > *class* max.pipelines.lib.pipeline.KVCacheMixin(\*args, \*\*kwargs) ### `estimate_kv_cache_size()` {#max.pipelines.lib.pipeline.KVCacheMixin.estimate_kv_cache_size} > *abstract classmethod* estimate\_kv\_cache\_size(pipeline\_config, available\_cache\_memory, devices, huggingface\_config, kv\_cache\_config, cache\_dtype) Estimates the size of the kv cache in bytes. **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Device`](../driver.md#max.driver.Device) `]` ) * **huggingface\_config** (`AutoConfig` ) * **kv\_cache\_config** (`KVCacheConfig` ) * **cache\_dtype** ([`DType`](../dtype.md#max.dtype.DType) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `get_kv_params()` {#max.pipelines.lib.pipeline.KVCacheMixin.get_kv_params} > *abstract classmethod* get\_kv\_params(huggingface\_config, n\_devices, kv\_cache\_config, cache\_dtype) Returns the KV cache params for the pipeline model. **Parameters:** * **huggingface\_config** (`AutoConfig` ) * **n\_devices** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **kv\_cache\_config** (`KVCacheConfig` ) * **cache\_dtype** ([`DType`](../dtype.md#max.dtype.DType) ) **Return type:** [*KVCacheParams*](../nn/kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ### `get_num_layers()` {#max.pipelines.lib.pipeline.KVCacheMixin.get_num_layers} > *abstract classmethod* get\_num\_layers(huggingface\_config) Returns the number of layers for the pipeline model. **Parameters:** **huggingface\_config** (`AutoConfig` ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `load_kv_manager()` {#max.pipelines.lib.pipeline.KVCacheMixin.load_kv_manager} > load\_kv\_manager(session, available\_cache\_memory) Provided a PipelineConfig and InferenceSession, loads the KV manager. **Parameters:** * **session** ([`InferenceSession`](../engine.md#max.engine.InferenceSession) ) – Inference session to compile and init the KV cache. * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – Amount of memory available to the KV cache, in bytes. **Returns:** one per input modality. **Return type:** Either a single KV cache manager or a tuple of KV cache managers ## `ModelInputs` {#max.pipelines.lib.pipeline.ModelInputs} > *class* max.pipelines.lib.pipeline.ModelInputs Base class for model inputs. Use this class to encapsulate inputs for your model. You may store any number of dataclass fields The following example demonstrates how to create a custom inputs class for a model: ```python class ReplitInputs(ModelInputs): tokens: Tensor input_row_offsets: Tensor def __init__(self, tokens: Tensor, input_row_offsets: Tensor): self.tokens = tokens self.input_row_offsets = input_row_offsets tokens = Tensor.zeros((1, 2, 3), DType.int64) input_row_offsets = Tensor.zeros((1, 1, 1), DType.int64) # Initialize inputs inputs = ReplitInputs(tokens=tokens, input_row_offsets=input_row_offsets) # Access tensors list(inputs) == [tokens, input_row_offsets] # Output: True ``` ### `kv_cache_inputs` {#max.pipelines.lib.pipeline.ModelInputs.kv_cache_inputs} > kv\_cache\_inputs\*: [KVCacheInputs](../nn/kv_cache/manager.md#max.nn.kv_cache.manager.KVCacheInputs) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ## `ModelOutputs` {#max.pipelines.lib.pipeline.ModelOutputs} > *class* max.pipelines.lib.pipeline.ModelOutputs(logits: 'Tensor', next\_token\_logits: 'Tensor | None' = None, logit\_offsets: 'Tensor | None' = None) **Parameters:** * **logits** ([`Tensor`](../driver.md#max.driver.Tensor) ) * **next\_token\_logits** ([`Tensor`](../driver.md#max.driver.Tensor) `|` `None` ) * **logit\_offsets** ([`Tensor`](../driver.md#max.driver.Tensor) `|` `None` ) ### `logit_offsets` {#max.pipelines.lib.pipeline.ModelOutputs.logit_offsets} > logit\_offsets\*: [Tensor](../driver.md#max.driver.Tensor) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Offsets to access variable length logits for each sequence. ### `logits` {#max.pipelines.lib.pipeline.ModelOutputs.logits} > logits\*: [Tensor](../driver.md#max.driver.Tensor)\* Logits for a variable number of tokens per sequence. ### `next_token_logits` {#max.pipelines.lib.pipeline.ModelOutputs.next_token_logits} > next\_token\_logits\*: [Tensor](../driver.md#max.driver.Tensor) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Logits for just the next token. ## `PipelineModel` {#max.pipelines.lib.pipeline.PipelineModel} > *class* max.pipelines.lib.pipeline.PipelineModel(pipeline\_config, session, huggingface\_config, encoding, devices, kv\_cache\_config, weights, adapter, return\_logits) A pipeline model with setup, input preparation and execution methods. **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **session** ([`InferenceSession`](../engine.md#max.engine.InferenceSession) ) * **huggingface\_config** (`AutoConfig` ) * **encoding** (`SupportedEncoding` ) * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Device`](../driver.md#max.driver.Device) `]` ) * **kv\_cache\_config** (`KVCacheConfig` ) * **weights** (`Weights` ) * **adapter** (`Optional` `[` `WeightsAdapter` `]` ) * **return\_logits** ([`ReturnLogits`](../nn/transformer/transformer.md#max.nn.transformer.transformer.ReturnLogits) ) ### `calculate_max_seq_len()` {#max.pipelines.lib.pipeline.PipelineModel.calculate_max_seq_len} > *abstract classmethod* calculate\_max\_seq\_len(pipeline\_config, huggingface\_config) Calculate the optimal max sequence length for the model. Models are expected to implement this method. The following example shows how to implement this method for a Mistral model: ```python class MistralModel(PipelineModel): @classmethod def calculate_max_seq_len(cls, pipeline_config, huggingface_config) -> int: try: return upper_bounded_default( upper_bound=huggingface_config.max_seq_len, default=pipeline_config.max_length, ) except ValueError as e: msg = ( "Unable to infer max_length for Mistral, the provided " f"max_length ({pipeline_config.max_length}) exceeds the " f"model's max_seq_len ({huggingface_config.max_seq_len})." ) raise ValueError(msg) from e ``` **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) – Configuration for the pipeline. * **huggingface\_config** (`AutoConfig` ) – Hugging Face model configuration. **Returns:** The maximum sequence length to use. **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `compute_log_probabilities()` {#max.pipelines.lib.pipeline.PipelineModel.compute_log_probabilities} > compute\_log\_probabilities(model\_inputs, model\_outputs, next\_tokens, batch\_top\_n, batch\_echo) Optional method that can be overridden to compute log probabilities. **Parameters:** * **model\_inputs** ([`ModelInputs`](#max.pipelines.lib.pipeline.ModelInputs) ) – Inputs to the model returned by prepare\_\*\_token\_inputs(). * **model\_outputs** ([`ModelOutputs`](#max.pipelines.lib.pipeline.ModelOutputs) ) – Outputs returned by execute(). * **next\_tokens** ([`Tensor`](../driver.md#max.driver.Tensor) ) – Sampled tokens. Should have shape=\[batch size] * **batch\_top\_n** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Number of top log probabilities to return per input in the batch. For any element where top\_n == 0, the LogProbabilities is skipped. * **batch\_echo** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`bool`](https://docs.python.org/3/library/functions.html#bool) `]` ) – Whether to include input tokens in the returned log probabilities. **Returns:** List of log probabilities. **Return type:** [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*LogProbabilities*](core.md#max.pipelines.core.LogProbabilities) | None] | None ### `dtype` {#max.pipelines.lib.pipeline.PipelineModel.dtype} > *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\* ### `estimate_weights_size()` {#max.pipelines.lib.pipeline.PipelineModel.estimate_weights_size} > *classmethod* estimate\_weights\_size(pipeline\_config) Calculates the estimated memory consumption of our model. **Parameters:** **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `execute()` {#max.pipelines.lib.pipeline.PipelineModel.execute} > *abstract* execute(model\_inputs) Executes the graph with the given inputs. **Parameters:** **model\_inputs** ([`ModelInputs`](#max.pipelines.lib.pipeline.ModelInputs) ) – The model inputs to execute, containing tensors and any other required data for model execution. **Returns:** ModelOutputs containing the pipeline’s output tensors. **Return type:** [*ModelOutputs*](#max.pipelines.lib.pipeline.ModelOutputs) This is an abstract method that must be implemented by concrete PipelineModels to define their specific execution logic. ### `infer_optimal_batch_size()` {#max.pipelines.lib.pipeline.PipelineModel.infer_optimal_batch_size} > *classmethod* infer\_optimal\_batch\_size(pipeline\_config, available\_cache\_memory, huggingface\_config, devices, kv\_cache\_config, cache\_dtype) Returns the estimated optimal batch size to run the model given current memory constraints. **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **huggingface\_config** (`AutoConfig` ) * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Device`](../driver.md#max.driver.Device) `]` ) * **kv\_cache\_config** (`KVCacheConfig` ) * **cache\_dtype** ([`DType`](../dtype.md#max.dtype.DType) ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `prepare_initial_token_inputs()` {#max.pipelines.lib.pipeline.PipelineModel.prepare_initial_token_inputs} > *abstract* prepare\_initial\_token\_inputs(context\_batch, kv\_cache\_inputs=None, return\_n\_logits=1) Prepares the initial inputs to be passed to .execute(). The inputs and functionality of this method can vary per model. For example, the model inputs could include: * Encoded tensors * A unique IDs for each tensor if this model uses a KV Cache manager. * kv\_cache\_inputs: The kv cache inputs required for the model. This should be None if the model does not use KV Cache. This function would batch the encoded tensors, claim a slot in the kv cache if the ID hasn’t been seen before, and return the inputs and caches as a list of tensors. **Parameters:** * **context\_batch** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` `T` `]` ) * **kv\_cache\_inputs** ([`KVCacheInputs`](../nn/kv_cache/manager.md#max.nn.kv_cache.manager.KVCacheInputs) `|` `None` ) * **return\_n\_logits** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [*ModelInputs*](#max.pipelines.lib.pipeline.ModelInputs) ### `prepare_next_token_inputs()` {#max.pipelines.lib.pipeline.PipelineModel.prepare_next_token_inputs} > *abstract* prepare\_next\_token\_inputs(next\_tokens, prev\_model\_inputs) Prepares the secondary inputs to be passed to .execute(). While prepare\_initial\_token\_inputs is responsible for managing the initial inputs. This function is responsible for updating the inputs, for each step in a multi-step execution pattern. **Parameters:** * **next\_tokens** ([`Tensor`](../driver.md#max.driver.Tensor) ) * **prev\_model\_inputs** ([`ModelInputs`](#max.pipelines.lib.pipeline.ModelInputs) ) **Return type:** [*ModelInputs*](#max.pipelines.lib.pipeline.ModelInputs) ## `TextGenerationPipeline` {#max.pipelines.lib.pipeline.TextGenerationPipeline} > *class* max.pipelines.lib.pipeline.TextGenerationPipeline(pipeline\_config, pipeline\_model, eos\_token\_id, weight\_adapters) Generalized token generator pipeline. **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **pipeline\_model** ([`type`](https://docs.python.org/3/library/functions.html#type) `[` [`PipelineModel`](#max.pipelines.lib.pipeline.PipelineModel) `]` ) * **eos\_token\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **weight\_adapters** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` `WeightsFormat` `,` `WeightsAdapter` `]` ) ### `calculate_num_steps()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.calculate_num_steps} > calculate\_num\_steps(num\_steps, context) **Parameters:** * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **context** (`T` ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `next_token()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.next_token} > next\_token(batch, num\_steps) Provided a batch, process batch inputs, execute the graph for num\_steps in a multi-step scenario, then decode the tokens holistically and return the list of decoded tokens. **Parameters:** * **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` `T` `]` ) * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*TextGenerationResponse*](core.md#max.pipelines.core.TextGenerationResponse)] ### `prepare_batch()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.prepare_batch} > prepare\_batch(batch, num\_steps) **Parameters:** * **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` ) * **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*ModelInputs*](#max.pipelines.lib.pipeline.ModelInputs), [int](https://docs.python.org/3/library/functions.html#int), *Tensor* | None] ### `release()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.release} > release(context) Mark the context as complete, releasing the cache slot from the KV manager. **Parameters:** **context** (`T` ) **Return type:** None ### `sample_logits()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.sample_logits} > sample\_logits(logits, prev\_tokens, logit\_offsets, bitmask, \*, token\_frequency\_data=None, token\_frequency\_row\_offsets=None) **Parameters:** * **logits** ([`Tensor`](../driver.md#max.driver.Tensor) ) * **prev\_tokens** ([`Tensor`](../driver.md#max.driver.Tensor) ) * **logit\_offsets** ([`Tensor`](../driver.md#max.driver.Tensor) `|` `None` ) * **bitmask** ([`Tensor`](../driver.md#max.driver.Tensor) `|` `None` ) * **token\_frequency\_data** ([`Tensor`](../driver.md#max.driver.Tensor) `|` `None` ) * **token\_frequency\_row\_offsets** ([`Tensor`](../driver.md#max.driver.Tensor) `|` `None` ) **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*Tensor*](../driver.md#max.driver.Tensor), [*Tensor*](../driver.md#max.driver.Tensor)] ## `get_paged_manager()` {#max.pipelines.lib.pipeline.get_paged_manager} > max.pipelines.lib.pipeline.get\_paged\_manager(pipeline) **Parameters:** **pipeline** ([`TokenGenerator`](core.md#max.pipelines.core.TokenGenerator) ) **Return type:** *PagedKVCacheManager* | None ## `upper_bounded_default()` {#max.pipelines.lib.pipeline.upper_bounded_default} > max.pipelines.lib.pipeline.upper\_bounded\_default(upper\_bound, default) Given an upper bound and an optional default value, returns a final value that cannot exceed the upper bound. **Parameters:** * **default** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) – The default value to use, or None to use the upper bound. * **upper\_bound** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The upper bound to use. **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the provided default value exceeds the upper bound. **Returns:** The final value. **Return type:** [int](https://docs.python.org/3/library/functions.html#int) --- ## pipelines NOTE: These APIs are under heavy development and subject to change. ## Modules * [`architectures`](/max/api/python/pipelines/architectures) * [`config`](/max/api/python/pipelines/config) * [`core`](/max/api/python/pipelines/core) * [`hf_pipeline`](/max/api/python/pipelines/hf_pipeline) * [`hf_utils`](/max/api/python/pipelines/hf_utils) * [`pipeline`](/max/api/python/pipelines/pipeline) * [`registry`](/max/api/python/pipelines/registry) * [`sampling`](/max/api/python/pipelines/sampling) * [`tokenizer`](/max/api/python/pipelines/tokenizer) ## Packages --- ## PipelineState `@register_passable(trivial)` `struct PipelineState[num_stages: Int]` Manages state for a multi-stage pipeline with circular buffer semantics. PipelineState provides a mechanism for tracking the current stage in a multi-stage pipeline, particularly useful for double or triple buffering in GPU tensor operations. It maintains an index that cycles through the available stages, a phase bit that toggles when the index wraps around, and a monotonically increasing count. This struct is commonly used with TMA operations to coordinate the use of multiple buffers in a pipeline fashion, allowing for overlapping computation and data transfer. ## Parameters * ​num\_stages (`Int`): The number of stages in the pipeline (e.g., 2 for double buffering, 3 for triple buffering). ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Initialize a PipelineState with default values. Creates a new PipelineState with index 0, phase 0, and count 0. `__init__(index: Int, phase: Int, count: Int) -> Self` Initialize a PipelineState with specific values. Creates a new PipelineState with the specified index, phase, and count. **Args:** * ​index (`Int`): The initial stage index. * ​phase (`Int`): The initial phase value (0 or 1). * ​count (`Int`): The initial count value. ### `index` `index(self) -> Int` Get the current stage index. **Returns:** The current index value, which ranges from 0 to num\_stages-1. ### `phase` `phase(self) -> SIMD[uint32, 1]` Get the current phase bit. **Returns:** The current phase value (0 or 1), which toggles when the index wraps around. ### `step` `step(mut self)` Advance the pipeline state to the next stage. Increments the index and count. When the index reaches num\_stages, it wraps around to 0 and toggles the phase bit. This function is used to move to the next buffer in a multi-buffer pipeline, implementing circular buffer semantics. --- ## pmaddubs `pmaddubs[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]` --- ## pmaddw `pmaddw[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]` --- ## pointer Implements the Pointer type. You can import these APIs from the `memory` package. For example: ```mojo from memory import Pointer ``` ## Structs * [​`AddressSpace`](/mojo/stdlib/memory/pointer/AddressSpace): Address space of the pointer. * [​`Pointer`](/mojo/stdlib/memory/pointer/Pointer): Defines a non-nullable safe pointer. --- ## Pointer `@register_passable(trivial)` `struct Pointer[mut: Bool, //, type: AnyType, origin: Origin[mut], address_space: AddressSpace = AddressSpace(0)]` Defines a non-nullable safe pointer. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/) in the Mojo Manual. ## Parameters * ​mut (`Bool`): Whether the pointee data may be mutated through this. * ​type (`AnyType`): Type of the underlying data. * ​origin (`Origin[mut]`): The origin of the pointer. * ​address\_space (`AddressSpace`): The address space of the pointee data. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Aliases ### `Immutable` `alias Immutable = Pointer[type, (muttoimm origin._mlir_origin), address_space]` The immutable version of the `Pointer`. ### `Mutable` `alias Mutable = Pointer[type, (mutcast origin._mlir_origin), address_space]` The mutable version of the `Pointer`. ## Methods ### `__init__` `__init__(*, ref [origin, address_space] to: type) -> Self` Constructs a Pointer from a reference to a value. **Args:** * ​to (`type`): The value to construct a pointer to. ### `__getitem__` `__getitem__(self) -> ref [origin, address_space] type` Enable subscript syntax `ptr[]` to access the element. **Returns:** A reference to the underlying value in memory. ### `__eq__` `__eq__(self, rhs: Pointer[type, origin, address_space]) -> Bool` Returns True if the two pointers are equal. **Args:** * ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer. **Returns:** True if the two pointers are equal and False otherwise. ### `__ne__` `__ne__(self, rhs: Pointer[type, origin, address_space]) -> Bool` Returns True if the two pointers are not equal. **Args:** * ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer. **Returns:** True if the two pointers are not equal and False otherwise. ### `address_of` `static address_of(ref [origin, address_space] value: type) -> Self` Constructs a Pointer from a reference to a value. **Args:** * ​value (`type`): The value to get the address of. **Returns:** The result Pointer. ### `copy` `copy(self) -> Self` Constructs a copy from another Pointer. Note that this does **not** copy the underlying data. **Returns:** A copy of the value. ### `get_immutable` `get_immutable(self) -> Pointer[type, (muttoimm origin._mlir_origin), address_space]` Constructs a new Pointer with the same underlying target and an ImmutableOrigin. Notes: This does **not** copy the underlying data. **Returns:** A new Pointer with the same target as self and an ImmutableOrigin. ### `__str__` `__str__(self) -> String` Gets a string representation of the Pointer. **Returns:** The string representation of the Pointer. ### `__merge_with__` `__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Pointer[type, $1, address_space]]](self) -> Pointer[type, origin, address_space]` Returns a pointer merged with the specified `other_type`. **Parameters:** * ​other\_type (`AnyStruct[Pointer[type, $1, address_space]]`): The type of the pointer to merge with. **Returns:** A pointer merged with the specified `other_type`. --- ## polynomial Provides two implementations for evaluating polynomials. You can import these APIs from the `math` package. For example: ```mojo from math.polynomial import polynomial_evaluate ``` ## Functions * [​`polynomial_evaluate`](/mojo/stdlib/math/polynomial/polynomial_evaluate): Evaluates the polynomial. --- ## polynomial_evaluate `polynomial_evaluate[: Bool, dtype: DType, simd_width: Int, //, coefficients: List[SIMD[dtype, simd_width], $0]](x: SIMD[dtype, simd_width]) -> SIMD[dtype, simd_width]` Evaluates the polynomial. **Parameters:** * ​dtype (`DType`): The dtype of the value. * ​simd\_width (`Int`): The simd\_width of the computed value. * ​coefficients (`List[SIMD[dtype, simd_width], $0]`): The coefficients. **Args:** * ​x (`SIMD[dtype, simd_width]`): The value to compute the polynomial with. **Returns:** The polynomial evaluation results using the specified value and the constant coefficients. --- ## pool ## Structs * [​`PoolMethod`](./PoolMethod): ## Functions * [​`avg_pool`](./avg_pool): Computes the average pool. * [​`avg_pool_gpu`](./avg_pool_gpu): Computes the average pool on GPU. * [​`max_pool`](./max_pool): Computes fp32 pooling. * [​`max_pool_gpu`](./max_pool_gpu): Computes max pooling on GPU. * [​`pool_shape`](./pool_shape): * [​`pool_shape_ceil`](./pool_shape_ceil): * [​`pool_shape_impl`](./pool_shape_impl): Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format. --- ## pool_shape `pool_shape[input_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, 1, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]` --- ## pool_shape_ceil `pool_shape_ceil[input_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, 1, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]` --- ## pool_shape_impl `pool_shape_impl[input_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool, ceil_mode: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, 1, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]` Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​input\_type (`DType`): Type of the input tensor. * ​filter\_type (`DType`): Type of the filter tensor. * ​strides\_type (`DType`): Type of the strides tensor. * ​dilations\_type (`DType`): Type of the dilations tensor. * ​paddings\_type (`DType`): Type of the paddings tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​ceil\_mode (`Bool`): Define rounding mode for shape calculation. **Args:** * ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​filter\_buf (`NDBuffer[filter_type, 1, origin]`): The filter size buffer. * ​strides\_buf (`NDBuffer[strides_type, 1, origin]`): The strides size buffer. * ​dilations\_buf (`NDBuffer[dilations_type, 1, origin]`): The dilations size buffer. * ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings size buffer. **Returns:** The output shape. --- ## PoolMethod `@register_passable(trivial)` `struct PoolMethod` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `AVG` `alias AVG = PoolMethod(1)` ### `MAX` `alias MAX = PoolMethod(0)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` --- ## pop_count `pop_count(val: Int) -> Int` Counts the number of bits set in an integer value. **Args:** * ​val (`Int`): The input value. **Returns:** The number of bits set in the input value. `pop_count[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Counts the number of bits set in a SIMD vector of integer values. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` contains the number of bits set in the element at position `i` of the input value. --- ## pow `pow[T: Powable](base: T, exp: T) -> T` Computes the `base` raised to the power of the `exp`. **Parameters:** * ​T (`Powable`): A type conforming to the `Powable` trait. **Args:** * ​base (`T`): The base of the power operation. * ​exp (`T`): The exponent of the power operation. **Returns:** The `base` raised to the power of the `exp`. `pow(base: SIMD[dtype, size], exp: Int) -> SIMD[dtype, size]` Computes elementwise value of a SIMD vector raised to the power of the given integer. **Args:** * ​base (`SIMD[dtype, size]`): The first input argument. * ​exp (`Int`): The second input argument. **Returns:** The `base` elementwise raised raised to the power of `exp`. --- ## Powable The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types. Types that conform to `Powable` will work with the builtin `pow` function, which will return the same type as the inputs. For example: ```mojo struct Rational(Powable): var numerator: Float64 var denominator: Float64 fn __init__(out self, numerator: Float64, denominator: Float64): self.numerator = numerator self.denominator = denominator fn __pow__(self, exp: Self) -> Self: var exp_value = exp.numerator / exp.denominator return Self(pow(self.numerator, exp_value), pow(self.denominator, exp_value)) ``` You can now use the \*\* operator to exponentiate objects inside generic functions: ```mojo fn exponentiate[T: Powable](base: T, exp: T) -> T: return base ** exp var base = Rational(Float64(3.0), 5.0) var exp = Rational(Float64(1.0), 2.0) var res = exponentiate(base, exp) ``` ```plaintext raising to power ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__pow__` `__pow__(self: _Self, exp: _Self) -> _Self` Return the value raised to the power of the given exponent. **Args:** * ​exp (`_Self`): The exponent value. **Returns:** The value of `self` raised to the power of `exp`. --- ## prefetch `prefetch[dtype: DType, //, params: PrefetchOptions = PrefetchOptions()](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin])` Prefetches an instruction or data into cache before it is used. The prefetch function provides prefetching hints for the target to prefetch instruction or data into cache before they are used. **Parameters:** * ​dtype (`DType`): The DType of value stored in addr. * ​params (`PrefetchOptions`): Configuration options for the prefect intrinsic. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The data pointer to prefetch. --- ## PrefetchCache `@register_passable(trivial)` `struct PrefetchCache` Prefetch cache type. ## Fields * ​value (`SIMD[int32, 1]`): The cache prefetch. It should be in \[0, 1]. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `DATA` `alias DATA = PrefetchCache(1)` The data prefetching option. ### `INSTRUCTION` `alias INSTRUCTION = PrefetchCache(0)` The instruction prefetching option. ## Methods ### `__init__` `__init__(value: Int) -> Self` Constructs a prefetch option. **Args:** * ​value (`Int`): An integer value representing the prefetch cache option to be used. Should be a value in the range `[0, 1]`. --- ## PrefetchLocality `@register_passable(trivial)` `struct PrefetchLocality` The prefetch locality. The locality, rw, and cache type correspond to LLVM prefetch intrinsic's inputs (see [LLVM prefetch locality](https://llvm.org/docs/LangRef.html#llvm-prefetch-intrinsic)) ## Fields * ​value (`SIMD[int32, 1]`): The prefetch locality to use. It should be a value in \[0, 3]. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `HIGH` `alias HIGH = PrefetchLocality(3)` Extremely local locality (keep in cache). ### `LOW` `alias LOW = PrefetchLocality(1)` Low locality. ### `MEDIUM` `alias MEDIUM = PrefetchLocality(2)` Medium locality. ### `NONE` `alias NONE = PrefetchLocality(0)` No locality. ## Methods ### `__init__` `__init__(value: Int) -> Self` Constructs a prefetch locality option. **Args:** * ​value (`Int`): An integer value representing the locality. Should be a value in the range `[0, 3]`. --- ## PrefetchOptions `@register_passable(trivial)` `struct PrefetchOptions` Collection of configuration parameters for a prefetch intrinsic call. The op configuration follows similar interface as LLVM intrinsic prefetch op, with a "locality" attribute that specifies the level of temporal locality in the application, that is, how soon would the same data be visited again. Possible locality values are: `NONE`, `LOW`, `MEDIUM`, and `HIGH`. The op also takes a "cache tag" attribute giving hints on how the prefetched data will be used. Possible tags are: `ReadICache`, `ReadDCache` and `WriteDCache`. Note: the actual behavior of the prefetch op and concrete interpretation of these attributes are target-dependent. ## Fields * ​rw (`PrefetchRW`): Indicates prefetching for read or write. * ​locality (`PrefetchLocality`): Indicates locality level. * ​cache (`PrefetchCache`): Indicates i-cache or d-cache prefetching. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` Constructs an instance of PrefetchOptions with default params. ### `for_read` `for_read(self) -> Self` Sets the prefetch purpose to read. **Returns:** The updated prefetch parameter. ### `for_write` `for_write(self) -> Self` Sets the prefetch purpose to write. **Returns:** The updated prefetch parameter. ### `no_locality` `no_locality(self) -> Self` Sets the prefetch locality to none. **Returns:** The updated prefetch parameter. ### `low_locality` `low_locality(self) -> Self` Sets the prefetch locality to low. **Returns:** The updated prefetch parameter. ### `medium_locality` `medium_locality(self) -> Self` Sets the prefetch locality to medium. **Returns:** The updated prefetch parameter. ### `high_locality` `high_locality(self) -> Self` Sets the prefetch locality to high. **Returns:** The updated prefetch parameter. ### `to_data_cache` `to_data_cache(self) -> Self` Sets the prefetch target to data cache. **Returns:** The updated prefetch parameter. ### `to_instruction_cache` `to_instruction_cache(self) -> Self` Sets the prefetch target to instruction cache. **Returns:** The updated prefetch parameter. --- ## PrefetchRW `@register_passable(trivial)` `struct PrefetchRW` Prefetch read or write. ## Fields * ​value (`SIMD[int32, 1]`): The read-write prefetch. It should be in \[0, 1]. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `READ` `alias READ = PrefetchRW(0)` Read prefetch. ### `WRITE` `alias WRITE = PrefetchRW(1)` Write prefetch. ## Methods ### `__init__` `__init__(value: Int) -> Self` Constructs a prefetch read-write option. **Args:** * ​value (`Int`): An integer value representing the prefetch read-write option to be used. Should be a value in the range `[0, 1]`. --- ## Prefill Prefill is the first phase of an AI model's forward pass in which the model processes the input and initializes a cache to accelerate predictions. Different model architectures may have their own version of a prefill, but it's primarily associated with large language models (LLMs), in which case it's also called [context encoding](context-encoding.mdx). --- ## Prefix caching with PagedAttention Prefix caching is a technique that caches the key-value (KV) cache of existing inference requests so that new queries can reuse the context encoded in the KV cache if they share the same prefix. This eliminates redundant computations and improves performance for workloads with repeated prefixes. By default, prefix caching is disabled in MAX. It can be enabled using the `--enable-prefix-caching` flag. :::note Prefix caching with MAX is still in preview and some aspects may change as we refine the implementation. Expect ongoing improvements and potential adjustments based on feedback and performance optimizations. ::: ## When to use prefix caching Prefix caching speeds up the pre-fill stage of inference, which reduces time to first token (TTFT). It can also reduce memory usage within the KV cache for all requests, which makes room for scheduling larger batches and yielding higher throughput. Prefix caching can provide significant performance improvements in the following scenarios: 1. **Similar queries**: When a user repeatedly makes similar queries that use the same system prompt instructions, the KV cache of the prefix can be stored in advance to reduce redundant computation. 2. **Multi-round conversations**: In chat applications, users often ask follow-up queries related to previous inputs. Since the server releases KV cache memory after each request, prefix caching preserves computation from past conversation turns without requiring an explicit session. Prefix caching won't result in performance degradation. However, it also does not provide additional benefit in the following cases: - **Unique queries**: If new queries do not share prefixes with previous queries, there is no opportunity to reuse cached KV values, making prefix caching ineffective. - **Long response generation**: Prefix caching only speeds up the pre-fill phase of a request. If most of the time is spent generating new tokens (decoding), caching will have little impact. ## How prefix caching works Prefix caching works by storing the key-value (KV) cache for a prefix and applying it to future prompts that include the same prefix, reducing redundant computation. You must specify all of the following to use prefix caching with the `max` CLI: - `--cache-strategy` : Prefix caching requires PagedAttention. To use PagedAttention, set your cache strategy to `paged`. - `--enable-prefix-caching`: Enables prefix caching. - `--kv-cache-page-size`: PagedAttention currently requires a page size that is a multiple of 128. Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model with prefix caching using the `max` CLI, you can use the flag `--devices cpu` for CPU or `--devices gpu` for GPU workloads. If no flag is provided, the model runs on the first available GPU, or on the first available CPU if no GPUs are available. ## Quickstart You can enable prefix caching when serving your model with the [`max` CLI](/max/max-cli#serve). To install the `max` CLI, see the [installation guide](/max/packages). ``` max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \ --cache-strategy paged \ --enable-prefix-caching \ --kv-cache-page-size 128 \ --quantization-encoding float32 ``` :::note Paged KV caching does not support quantized encodings. It may take some time to download the `float32` weights. For more information about encoding options in MAX, see [Quantization](/max/graph/quantize). ::: ## Next steps Now that you know the basics of prefix caching and PagedAttention, you can get started with MAX on GPUs. MAX also includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. You can use this script to track performance gains from prefix caching. For more detailed instructions on benchmarking, please see [Benchmark MAX](https://github.com/modular/modular/tree/main/benchmark). export const cards_start = [ { title: 'Deploy Llama 3 on GPU with MAX', link: '/max/tutorials/max-serve-local-to-cloud', description: `Learn how to deploy an LLM to the cloud on GPU.`, }, { title: 'Deploy Llama 3.1 on GPU-powered Kubernetes clusters', link: '/max/tutorials/deploy-max-serve-on-kubernetes', description: `Learn how to deploy Llama 3.1 using Kubernetes, MAX, and NVIDIA GPUs`, }, ]; --- ## prefix_product `prefix_product(a: IntTuple[origin]) -> IntTuple` Compute the exclusive prefix product of an `IntTuple`. This is a convenience wrapper that initializes the prefix product with 1. **Args:** * ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for. **Returns:** A new `IntTuple` containing the exclusive prefix product of the input. `prefix_product(a: IntTuple[origin], init: Int) -> IntTuple` Compute the exclusive prefix product of an `IntTuple` with an initial value. This function delegates to the implementation in prefix\_product2. **Args:** * ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for. * ​init (`Int`): The initial value(s) for the prefix product, defaults to 1. **Returns:** A new `IntTuple` containing the exclusive prefix product of the input. --- ## prefix_product `prefix_product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> RuntimeTuple[prefix_product[::Origin[::Bool(t)]` Computes the prefix products of elements in the `RuntimeTuple`. This function calculates the running product of elements, where each output element is the product of all previous elements in the input. This is commonly used in tensor computations to calculate stride values. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`. **Returns:** A new `RuntimeTuple` containing the prefix products of the input elements. --- ## prefix_sum `prefix_sum[type: DType, //, *, block_size: Int, exclusive: Bool = False](val: SIMD[type, 1]) -> SIMD[type, 1]` Performs a prefix sum (scan) operation across all threads in a block. This function implements a block-level inclusive or exclusive scan, efficiently computing the cumulative sum for each thread based on thread indices. **Parameters:** * ​type (`DType`): The data type of the Scalar elements. * ​block\_size (`Int`): The total number of threads in the block. * ​exclusive (`Bool`): If True, perform exclusive scan instead of inclusive. **Args:** * ​val (`SIMD[type, 1]`): The Scalar value from each thread to include in the scan. **Returns:** A Scalar value containing the result of the scan operation for each thread. --- ## prefix_sum `prefix_sum[type: DType, //, intermediate_type: DType = type, *, output_type: DType = type, exclusive: Bool = False](x: SIMD[type, 1]) -> SIMD[output_type, 1]` Computes a warp-level prefix sum (scan) operation. Performs an inclusive or exclusive prefix sum across threads in a warp using a parallel scan algorithm with warp shuffle operations. This implements an efficient parallel scan with logarithmic complexity. For example, if we have a warp with the following elements: $$ [x_0, x_1, x_2, x_3, x_4] $$ The prefix sum is: $$ [x_0, x_0 + x_1, x_0 + x_1 + x_2, x_0 + x_1 + x_2 + x_3, x_0 + x_1 + x_2 + x_3 + x_4] $$ **Parameters:** * ​type (`DType`): The data type of the input SIMD elements. * ​intermediate\_type (`DType`): Type used for intermediate calculations (defaults to input type). * ​output\_type (`DType`): The desired output data type (defaults to input type). * ​exclusive (`Bool`): If True, performs exclusive scan where each thread receives the sum of all previous threads. If False (default), performs inclusive scan where each thread receives the sum including its own value. **Args:** * ​x (`SIMD[type, 1]`): The SIMD value to include in the prefix sum. **Returns:** A scalar containing the prefix sum at the current thread's position in the warp, cast to the specified output type. --- ## prev_power_of_two `prev_power_of_two(val: Int) -> Int` Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0. This operation is called `bit_floor()` in C++. **Args:** * ​val (`Int`): The input value. **Returns:** The largest power of 2 that is less than or equal to the input value. `prev_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the largest power of 2 that is less than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 0 will be floored to 0. This operation is called `bit_floor()` in C++. **Constraints:** The element type of the input vector must be integral. **Parameters:** * ​dtype (`DType`): `dtype` used for the computation. * ​width (`Int`): SIMD width used for the computation. **Args:** * ​val (`SIMD[dtype, width]`): The input value. **Returns:** A SIMD value where the element at position `i` is the largest power of 2 that is less than or equal to the integer at position `i` of the input value. --- ## print `print[*Ts: Writable](*values: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" "), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("\n"), flush: Bool = False, owned file: FileDescriptor = FileDescriptor(1))` Prints elements to the text stream. Each element is separated by `sep` and followed by `end`. **Parameters:** * ​\*Ts (`Writable`): The elements types. **Args:** * ​\*values (`*Ts`): The elements to print. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. * ​flush (`Bool`): If set to true, then the stream is forcibly flushed. * ​file (`FileDescriptor`): The output stream. --- ## print_kv_cache_cont_batch_generic_cpu `print_kv_cache_cont_batch_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_kv_cache_cont_batch_generic_gpu `print_kv_cache_cont_batch_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_kv_cache_paged_generic_cpu `print_kv_cache_paged_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_kv_cache_paged_generic_gpu `print_kv_cache_paged_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)` --- ## print_layout `print_layout(layout: Layout)` Prints a 2D layout to the standard output. This function visualizes a 2D layout by printing a formatted table showing the memory indices for each logical coordinate. **Args:** * ​layout (`Layout`): The 2D layout to print. --- ## prod_dims `prod_dims[start_dim: Int, end_dim: Int](x: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Int` Computes the product of a slice of the given buffer's dimensions. **Parameters:** * ​start\_dim (`Int`): The index at which to begin computing the product. * ​end\_dim (`Int`): The index at which to stop computing the product. **Args:** * ​x (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer whose dimensions will be multiplied. **Returns:** The product of the specified slice of the buffer's dimensions. --- ## producer_main_loop `producer_main_loop[a_type: DType, b_type: DType, a_tile_layout: Layout, b_tile_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, a_desc_layout: Layout, b_desc_layout: Layout, pipeline_stages: Int, /, *, block_tile_shape: IndexList[3], cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), partitioned_multicast: Bool = False](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_k_iters: Int, m_coord: UInt, n_coord: UInt, rank_n: UInt, rank_m: UInt, mut write_pipeline_states: PipelineState[pipeline_stages], empty_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], full_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8])` --- ## product `product(t: IntTuple[origin]) -> Int` Calculate the product of all values in an `IntTuple`. This function recursively computes the product of all integer values in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to multiply. **Returns:** The product of all integer values, or `UNKNOWN_VALUE` if any value in the tuple is `UNKNOWN_VALUE`. --- ## product `product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Int` Computes the product of all elements in the `RuntimeTuple`. This function multiplies all scalar values in the tuple, including those in nested tuples after flattening. This is commonly used to calculate the total size of a tensor from its shape. **Parameters:** * ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple. **Args:** * ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`. **Returns:** The product of all scalar elements in the tuple. --- ## product `product(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the product of the buffer elements. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The product of the buffer elements. `product[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the product across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `product[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the product across the input and output shape. This performs the product computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the product on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## product `product[size: Int](tuple: IndexList[size, element_type=element_type], end_idx: Int = size) -> Int` Computes a product of values in the tuple up to the given index. **Parameters:** * ​size (`Int`): The tuple size. **Args:** * ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of. * ​end\_idx (`Int`): The end index. **Returns:** The product of all tuple elements in the given range. `product[size: Int](tuple: IndexList[size, element_type=element_type], start_idx: Int, end_idx: Int) -> Int` Computes a product of values in the tuple in the given index range. **Parameters:** * ​size (`Int`): The tuple size. **Args:** * ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of. * ​start\_idx (`Int`): The start index of the range. * ​end\_idx (`Int`): The end index of the range. **Returns:** The product of all tuple elements in the given range. --- ## product_each `product_each(t: IntTuple[origin]) -> IntTuple` Compute the product of elements in each sub-tuple of an `IntTuple`. For each immediate child of the input tuple, this function computes the product of all elements within that child. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` containing sub-tuples. **Returns:** A new `IntTuple` where each element is the product of the corresponding sub-tuple in the input. --- ## ProfileBlock `struct ProfileBlock[enabled: Bool = False]` A struct for profiling code blocks. This struct provides context manager functionality to profile code blocks. When enabled, it records the start and end time of the block and prints the timing information. ## Parameters * ​enabled (`Bool`): Whether profiling is enabled for this block. ## Fields * ​name (`StringSlice[StaticConstantOrigin]`): Name of the profiling block used for identification in timing output. * ​loc (`_SourceLocation`): Source code location information for the profiling block, including file, line, and column. * ​start\_time (`UInt`): Start time of the profiling block in nanoseconds, captured using perf\_counter\_ns(). ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, name: StringSlice[StaticConstantOrigin])` Initialize a new ProfileBlock. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): Name to identify this profiling block. ### `__enter__` `__enter__(mut self)` Enter the profiling block and record start time if enabled. ### `__exit__` `__exit__(mut self)` Exit the profiling block, record end time and print timing if enabled. --- ## profiler This module provides GPU profiling functionality. The profiler module enables performance profiling of GPU code blocks through a simple context manager interface. It includes: * ProfileBlock: A context manager for timing code blocks * Configurable profiling that can be enabled/disabled at compile time * Nanosecond precision timing using perf\_counter\_ns() * Source location tracking for profiled blocks * Formatted timing output Example: ```mojo from gpu import profiler with profiler.ProfileBlock("my_kernel"): # Code to profile run_gpu_kernel() ``` ## Structs * [​`ProfileBlock`](/mojo/stdlib/gpu/profiler/ProfileBlock): A struct for profiling code blocks. --- ## promote_to_cuda_cores `promote_to_cuda_cores[accum_type: DType, layout: Layout](c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], final_c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` --- ## propagate_unknown `propagate_unknown(src: IntTuple[origin], target: IntTuple[origin]) -> IntTuple` Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`. This function creates a new `IntTuple` by combining the source and target `IntTuple`s, preserving unknown dimensions (UNKNOWN\_VALUE) from the target while using values from the source for known dimensions. **Args:** * ​src (`IntTuple[origin]`): The source `IntTuple` containing known dimension values. * ​target (`IntTuple[origin]`): The target `IntTuple` that may contain unknown dimensions (UNKNOWN\_VALUE). **Returns:** A new `IntTuple` with unknown dimensions from target and known dimensions from src. --- ## pwd Provides access to user and group information from the password database. Use the [`Passwd`](/mojo/stdlib/pwd/pwd/Passwd) type to access user account information such as user name, ID, group, and home directory. ## Modules * [​`pwd`](/mojo/stdlib/pwd/pwd/): --- ## pwd ## Structs * [​`Passwd`](/mojo/stdlib/pwd/pwd/Passwd): Represents user account information retrieved from the user password database related to a user ID. ## Functions * [​`getpwnam`](/mojo/stdlib/pwd/pwd/getpwnam): Retrieves the user ID in the password database for the given user name. * [​`getpwuid`](/mojo/stdlib/pwd/pwd/getpwuid): Retrieve the password database entry for a given user ID. --- ## PyMojoObject `struct PyMojoObject[T: AnyType]` Storage backing a PyObject\* wrapping a Mojo value. This struct represents the C-level layout of a Python object that contains a wrapped Mojo value. It must be ABI-compatible with CPython's PyObject structure to enable seamless interoperability between Mojo and Python. The struct follows Python's object model where all Python objects begin with a PyObject header (ob\_base), followed by type-specific data. In this case, the type-specific data is a Mojo value of type T. See for more details. ## Parameters * ​T (`AnyType`): The Mojo type being wrapped. Can be any type that satisfies `AnyType`. ## Fields * ​ob\_base (`PyObject`): The standard Python object header containing reference count and type information. This must be the first field to maintain ABI compatibility with Python's object layout. All Python objects begin with this header structure. * ​mojo\_value (`T`): The actual Mojo value being wrapped and exposed to Python. This field stores the Mojo data that Python code can interact with through the registered type methods and bindings. ## Implemented traits `AnyType`, `UnknownDestructibility` --- ## python Implements the python package. ## Modules * [​`bindings`](/mojo/stdlib/python/bindings/): * [​`python`](/mojo/stdlib/python/python/): Implements Python interoperability. * [​`python_object`](/mojo/stdlib/python/python_object/): Implements PythonObject. --- ## python Implements Python interoperability. You can import these APIs from the `python` package. For example: ```mojo from python import Python ``` ## Structs * [​`Python`](/mojo/stdlib/python/python/Python): Provides methods that help you use Python code in Mojo. --- ## Python `struct Python` Provides methods that help you use Python code in Mojo. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Default constructor. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy constructor. **Args:** * ​existing (`Self`): The existing instance to copy from. ### `cpython` `cpython(self) -> ref [StaticConstantOrigin] CPython` Handle to the low-level C API of the CPython interpreter present in the current process. **Returns:** Handle to the CPython interpreter instance in the current process. ### `eval` `eval(self, owned code: String) -> Bool` Executes the given Python code. **Args:** * ​code (`String`): The python code to execute. **Returns:** `True` if the code executed successfully or `False` if the code raised an exception. ### `evaluate` `static evaluate(owned expr: String, file: Bool = False, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("__main__")) -> PythonObject` Executes the given Python code. **Args:** * ​expr (`String`): The Python expression to evaluate. * ​file (`Bool`): Evaluate as a file and return the module. * ​name (`StringSlice[StaticConstantOrigin]`): The name of the module (most relevant if `file` is True). **Returns:** `PythonObject` containing the result of the evaluation. ### `add_to_path` `static add_to_path(dir_path: StringSlice[origin])` Adds a directory to the Python path. This might be necessary to import a Python module via `import_module()`. For example: ```mojo from python import Python # Specify path to `mypython.py` module Python.add_to_path("path/to/module") var mypython = Python.import_module("mypython") var c = mypython.my_algorithm(2, 3) ``` **Args:** * ​dir\_path (`StringSlice[origin]`): The path to a Python module you want to import. ### `import_module` `static import_module(owned module: String) -> PythonObject` Imports a Python module. This provides you with a module object you can use just like you would in Python. For example: ```mojo from python import Python # This is equivalent to Python's `import numpy as np` np = Python.import_module("numpy") a = np.array([1, 2, 3]) ``` **Args:** * ​module (`String`): The Python module name. This module must be visible from the list of available Python paths (you might need to add the module's path with `add_to_path()`). **Returns:** The Python module. ### `create_module` `static create_module(name: StringSlice[StaticConstantOrigin]) -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]` Creates a Python module using the provided name. Inspired by TODO: allow specifying a doc-string to attach to the module upon creation or lazily added? **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The Python module name. **Returns:** The Python module. ### `add_functions` `static add_functions(module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")], owned functions: List[PyMethodDef])` Adds functions to a PythonModule object. **Args:** * ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The PythonModule object. * ​functions (`List[PyMethodDef]`): List of function data. **Raises:** If we fail to add the functions to the module. ### `add_object` `static add_object(module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")], owned name: String, value: PythonObject)` Add a new object to `module` with the given name and value. The provided object can be any type of Python object: an instance, a type object, a function, etc. The added value will be inserted into the `__dict__` of the provided module. **Args:** * ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The Python module to modify. * ​name (`String`): The name of the new object. * ​value (`PythonObject`): The python object value. ### `dict` `static dict[V: PythonConvertible & Copyable & Movable = PythonObject](*, owned **kwargs: V) -> PythonObject` Construct an Python dictionary from keyword arguments. **Parameters:** * ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the `PythonConvertible`, `Copyable`, and `Movable` traits. **Args:** * ​\*\*kwargs (`V`): The keyword arguments to construct the dictionary with. **Returns:** The constructed Python dictionary. **Raises:** On failure to construct the dictionary or convert the values to Python objects. `static dict[K: PythonConvertible & Copyable & Movable = PythonObject, V: PythonConvertible & Copyable & Movable = PythonObject](tuples: Span[Tuple[K, V], origin]) -> PythonObject` Construct an Python dictionary from a list of key-value tuples. **Parameters:** * ​K (`PythonConvertible & Copyable & Movable`): The type of the keys in the dictionary. Must implement the `PythonConvertible`, `Copyable`, and `Movable` traits. * ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the `PythonConvertible`, `Copyable`, and `Movable` traits. **Args:** * ​tuples (`Span[Tuple[K, V], origin]`): The list of key-value tuples to construct the dictionary with. **Returns:** The constructed Python dictionary. **Raises:** On failure to construct the dictionary or convert the keys or values to Python objects. ### `list` `static list[T: PythonConvertible & Copyable & Movable](values: Span[T, origin]) -> PythonObject` Initialize the object from a list of values. **Parameters:** * ​T (`PythonConvertible & Copyable & Movable`): The span element type. **Args:** * ​values (`Span[T, origin]`): The values to initialize the list with. **Returns:** A PythonObject representing the list. `static list[*Ts: PythonConvertible](*values: *Ts) -> PythonObject` Construct an Python list of objects. **Parameters:** * ​\*Ts (`PythonConvertible`): The list element types. **Args:** * ​\*values (`*Ts`): The values to initialize the list with. **Returns:** The constructed Python list. ### `tuple` `static tuple[*Ts: PythonConvertible](*values: *Ts) -> PythonObject` Construct an Python tuple of objects. **Parameters:** * ​\*Ts (`PythonConvertible`): The list element types. **Args:** * ​\*values (`*Ts`): The values to initialize the tuple with. **Returns:** The constructed Python tuple. ### `as_string_slice` `as_string_slice(self, str_obj: PythonObject) -> StringSlice[MutableAnyOrigin]` Return a string representing the given Python object. **Args:** * ​str\_obj (`PythonObject`): The Python object. **Returns:** Mojo string representing the given Python object. ### `type` `static type(obj: PythonObject) -> PythonObject` Return Type of this PythonObject. **Args:** * ​obj (`PythonObject`): PythonObject we want the type of. **Returns:** A PythonObject that holds the type object. ### `none` `static none() -> PythonObject` Get a `PythonObject` representing `None`. **Returns:** `PythonObject` representing `None`. ### `str` `static str(obj: PythonObject) -> PythonObject` Convert a PythonObject to a Python `str`. **Args:** * ​obj (`PythonObject`): The PythonObject to convert. **Returns:** A Python `str` object. **Raises:** An error if the conversion failed. ### `int` `static int(obj: PythonObject) -> PythonObject` Convert a PythonObject to a Python `int` (i.e. arbitrary precision integer). **Args:** * ​obj (`PythonObject`): The PythonObject to convert. **Returns:** A PythonObject representing the result of the conversion to `int`. **Raises:** If the conversion to `int` fails. ### `float` `static float(obj: PythonObject) -> PythonObject` Convert a PythonObject to a Python `float` object. **Args:** * ​obj (`PythonObject`): The PythonObject to convert. **Returns:** A Python `float` object. **Raises:** If the conversion fails. ### `py_long_as_ssize_t` `static py_long_as_ssize_t(obj: PythonObject) -> Int` Get the value of a Python `long` object. **Args:** * ​obj (`PythonObject`): The Python `long` object. **Returns:** The value of the `long` object as a `Py_ssize_t`. **Raises:** If `obj` is not a Python `long` object, or if the `long` object value overflows `Py_ssize_t`. ### `is_true` `static is_true(obj: PythonObject) -> Bool` Check if the PythonObject is truthy. **Args:** * ​obj (`PythonObject`): The PythonObject to check. **Returns:** True if the PythonObject is truthy and False otherwise. **Raises:** If the boolean value of the PythonObject cannot be determined. --- ## Python interoperability Because Mojo uses a Pythonic syntax, its easy to start reading and writing Mojo when coming from Python. Mojo also optimizes for ease of use in across the Python-Mojo language boundary, with built-in support for both calling **into Python** from Mojo, and calling **into Mojo** from Python. The common API for interoperability in both directions is the [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) type, which wraps a Python object within Mojo. In Mojo, you can import Python modules, construct Python objects, and call Python functions and methods directly. Mojo will first load the CPython interpreter as a dynamic library (called `libpython.dylib` on macOS), and use that interpreter to execute Python code. For example: ```mojo title="🔥 Mojo" from python import Python fn main(): # Loads CPython dynamically behind the scenes; returns a PythonObject var res = Python.evaluate("2 + 2") ``` Calling into Mojo from Python is different. Because Mojo is a compiled language, we can't directly "evaluate" Mojo code. Instead, Mojo code must declare up front which functions and types are available to be called from Python. For example: ```mojo title="🔥 mojo_module.mojo" @export fn PyInit_mojo_module() -> PythonObject: try: var m = PythonModuleBuilder("mojo_module") m.def_function[mojo_greet]("mojo_greet", docstring="Say hello from Mojo") return m.finalize() except e: return abort[PythonObject](String("error creating Python Mojo module:", e)) fn mojo_greet(name: PythonObject): print("Hello to", name, "from Mojo 👋") ``` By defining a suitable `PyInit_*()` function, Mojo performs the necessary low-level binding calls to inform Python how to call Mojo code: ```python title="🐍 main.py" import max._mojo.mojo_importer import mojo_module mojo_module.mojo_greet("Python") ``` (Although it's not quite that simple yet.) These quick examples give you a taste of what interoperability looks like for Python and Mojo. Flexible interop enables you to move incrementally and efficiently. By embracing both directions of language interop, you can choose how to use Mojo in a way that works best for your use case. **To learn more about bridging Python ↔ Mojo, continue reading**: import MDXListing from '@site/src/components/Listing/MDXListing'; export const docs = [ 'python-from-mojo', 'mojo-from-python', ] --- ## Python types When calling Python methods, Mojo needs to convert back and forth between native Python objects and native Mojo objects. Most of these conversions happen automatically, but there are a number of cases that Mojo doesn't handle yet. In these cases you may need to do an explicit conversion, or call an extra method. ## Mojo types in Python Mojo primitive types implicitly convert into Python objects. Today we support integers, floats, booleans, and strings. To demonstrate, the following example dynamically creates an in-memory Python module named `py_utils` containing a `type_printer()` function, which simply prints the type of a given value. Then you can see how different Mojo values convert into corresponding Python types. ```mojo from python import Python def main(): py_module = """ def type_printer(value): print(type(value)) """ py_utils = Python.evaluate(py_module, file=True, name="py_utils") py_utils.type_printer(4) py_utils.type_printer(3.14) py_utils.type_printer(True) py_utils.type_printer("Mojo") ``` ```output ``` ## Python types in Mojo You can also create and use Python objects from Mojo. ### Mojo wrapper objects When you use Python objects in your Mojo code, Mojo adds the [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper around the Python object. This object exposes a number of common double underscore methods (dunder methods) like `__getitem__()` and `__getattr__()`, passing them through to the underlying Python object. Most of the time, you can treat the wrapped object just like you'd treat it in Python. You can use dot-notation to access attributes and call methods, and use the `[]` operator to access an item in a sequence. You can explicitly create a wrapped Python object by initializing a `PythonObject` with a Mojo integer, float, boolean, or string. Additionally, you can create several types of Python collections directly in Mojo using the [`Python.dict()`](/mojo/stdlib/python/python/Python#dict), [`Python.list()`](/mojo/stdlib/python/python/Python#list), and [`Python.tuple()`](/mojo/stdlib/python/python/Python#tuple) static methods. For example, to create a Python dictionary, use the [`Python.dict()`](/mojo/stdlib/python/python/Python#dict) method: ```mojo from python import Python def main(): py_dict = Python.dict() py_dict["item_name"] = "whizbang" py_dict["price"] = 11.75 py_dict["inventory"] = 100 print(py_dict) ``` ```output {'item_name': 'whizbang', 'price': 11.75, 'inventory': 100} ``` With the [`Python.list()`](/mojo/stdlib/python/python/Python#list) method, you can create a Python list and optionally initialize it: ```mojo from python import Python def main(): py_list = Python.list("cat", 2, 3.14159, 4) n = py_list[2] print("n =", n) py_list.append(5) py_list[0] = "aardvark" print(py_list) ``` ```output n = 3.14159 ['aardvark', 2, 3.14159, 4, 5] ``` The [`Python.tuple()`](/mojo/stdlib/python/python/Python#tuple) method creates a Python tuple of values: ```mojo from python import Python def main(): py_tuple = Python.tuple("cat", 2, 3.1415, "cat") n = py_tuple[2] print("n =", n) print("Number of cats:", py_tuple.count("cat")) ``` ```output n = 3.1415 Number of cats: 2 ``` If you want to construct a Python type that doesn't have a literal Mojo equivalent, you can also use the [`Python.evaluate()`](/mojo/stdlib/python/python/Python#evaluate) method. For example, to create a Python `set`: ```mojo from python import Python def main(): var py_set = Python.evaluate('{2, 3, 2, 7, 11, 3}') num_items = len(py_set) print(num_items, "items in the set.") contained = 7 in py_set print("Is 7 in the set:", contained) ``` ```output 4 items in the set. Is 7 in the set: True ``` Some Mojo APIs handle `PythonObject` just fine, but sometimes you'll need to explicitly convert a Python value into a native Mojo value. Currently `PythonObject` conforms to the [`Stringable`](/mojo/stdlib/builtin/str/Stringable), [`Boolable`](/mojo/stdlib/builtin/bool/Boolable), [`Intable`](/mojo/stdlib/builtin/int/Intable), and [`Floatable`](/mojo/stdlib/builtin/floatable/Floatable/) traits. This allows you to convert a `PythonObject` to the corresponding Mojo types. ```mojo var s = String(py_string) var b = Bool(py_bool) var i = Int(py_int) var f = Float64(py_float) ``` PythonObject also implements the [`Writable`](/mojo/stdlib/utils/write/Writable) trait, so that you can print Python values using the built-in [`print()`](/mojo/stdlib/builtin/io/print) function. ```mojo print(python_object) ``` ### Comparing Python types in Mojo You can use Python objects in Mojo comparison expressions, and the Mojo `is` operator also works to compare the identity of two Python objects. Python values like `False` and `None` evaluate as false in Mojo boolean expressions as well. If you need to know the type of the underlying Python object, you can use the [`Python.type()`](/mojo/stdlib/python/python/Python#type) method, which is equivalent to the Python `type()` builtin. You can test if a Python object is of a particular type by performing an identity comparison against the type as shown below: ```mojo from python import Python from python import PythonObject def main(): var value1: PythonObject = 3.7 value2 = Python.evaluate("10/3") # Compare values print("Is value1 greater than 3:", value1 > 3) print("Is value1 greater than value2:", value1 > value2) # Compare identities value3 = value2 print("value1 is value2:", value1 is value2) print("value2 is value3:", value2 is value3) # Compare types py_float_type = Python.evaluate("float") print("Python float type:", py_float_type) print("value1 type:", Python.type(value1)) print("Is value1 a Python float:", Python.type(value1) is py_float_type) ``` ```output Is value1 greater than 3: True Is value1 greater than value2: True value1 is value2: False value2 is value3: True Python float type: value1 type: Is value1 a Python float: True ``` --- ## python_object Implements PythonObject. You can import these APIs from the `python` package. For example: ```mojo from python import PythonObject ``` ## Aliases ### `PyFunction` `alias PyFunction = fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject` ### `PyFunctionRaising` `alias PyFunctionRaising = fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject` ### `PythonModule` `alias PythonModule = TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]` ## Structs * [​`PythonObject`](/mojo/stdlib/python/python_object/PythonObject): A Python object. * [​`TypedPythonObject`](/mojo/stdlib/python/python_object/TypedPythonObject): A wrapper around `PythonObject` that indicates the type of the contained object. ## Traits * [​`ConvertibleFromPython`](/mojo/stdlib/python/python_object/ConvertibleFromPython): Denotes a type that can attempt construction from a read-only Python object. * [​`PythonConvertible`](/mojo/stdlib/python/python_object/PythonConvertible): A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method. --- ## PythonConvertible A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `to_python_object` `to_python_object(self: _Self) -> PythonObject` Convert a value to a PythonObject. **Returns:** A PythonObject representing the value. --- ## PythonModuleBuilder `struct PythonModuleBuilder` A builder for creating Python modules with Mojo function and type bindings. This builder provides a high-level API for declaring Python bindings for Mojo functions and types within a Python module. It manages the registration of functions, types, and their associated metadata, then finalizes everything into a complete Python module object. The builder follows a declarative pattern where you: 1. Create a builder instance with a module name 2. Add function bindings using `def_function()`, `def_py_function()`, `def_py_c_function()` 3. Add type bindings using `add_type[T]()` and configure them 4. Call `finalize()` to finish building the Python module. Example: ```mojo from python.bindings import PythonModuleBuilder var builder = PythonModuleBuilder("my_module") builder.def_function[my_func]("my_func", "Documentation for my_func") _ = builder.add_type[MyType]("MyType").def_method[my_method]("my_method") var module = builder.finalize() ``` Note: After calling `finalize()`, the builder's internal state is cleared and it should not be reused for creating additional modules. TODO: This should be enforced programmatically in the future. ## Fields * ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The Python module being built. * ​functions (`List[PyMethodDef]`): List of function definitions that will be exposed in the module. * ​type\_builders (`List[PythonTypeBuilder]`): List of type builders for types that will be exposed in the module. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, name: StringSlice[StaticConstantOrigin])` Construct a Python module builder with the given module name. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the module. **Raises:** If the module creation fails. `__init__(out self, module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")])` Construct a Python module builder with the given module. **Args:** * ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The module to build. ### `add_type` `add_type[T: Movable & Defaultable & Representable & TypeIdentifiable](mut self, type_name: StringSlice[StaticConstantOrigin]) -> ref [*[0,0].type_builders] PythonTypeBuilder` Add a type to the module and return a builder for it. **Parameters:** * ​T (`Movable & Defaultable & Representable & TypeIdentifiable`): The mojo type to bind in the module. **Args:** * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name of the type to expose in the module. **Returns:** A reference to a type builder registered in the module builder. ### `def_py_c_function` `def_py_c_function(mut self, func: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PyCFunction signature in the module. **Args:** * ​func (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The function to declare a binding for. * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. ### `def_py_function` `def_py_function[func: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PyFunction signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_py_function[func: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PyFunctionRaising signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. ### `def_function` `def_function[func: fn() raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn() raises -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject) raises -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn() -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn() -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject) -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn() raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn() raises -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject) raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject) raises -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject) raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject) raises -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn() -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn() -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject) -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject) -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject) -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject) -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. `def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())` Declare a binding for a function with PythonObject signature in the module. **Parameters:** * ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None`): The function to declare a binding for. **Args:** * ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the module. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module. ### `finalize` `finalize(mut self) -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]` Finalize the module builder, creating the module object. All types and functions added to the builder will be built and exposed in the module. After calling this method, the builder's internal state is cleared and it should not be reused for creating additional modules. **Returns:** The finalized Python module containing all registered functions and types. **Raises:** If the module creation fails or if we fail to add any of the declared functions or types to the module. --- ## PythonObject `@register_passable` `struct PythonObject` A Python object. ## Fields * ​py\_object (`PyObjectPtr`): A pointer to the underlying Python object. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `Movable`, `PythonConvertible`, `SizedRaising`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Initialize the object with a `None` value. `__init__(*, from_owned_ptr: PyObjectPtr) -> Self` Initialize this object from an owned reference-counted Python object pointer. Ownership of the reference will be assumed by `PythonObject`. **Args:** * ​from\_owned\_ptr (`PyObjectPtr`): The `PyObjectPtr` to take ownership of. `__init__(*, from_borrowed_ptr: PyObjectPtr) -> Self` Initialize this object from a read-only reference-counted Python object pointer. The reference count of the pointee object will be incremented, and ownership of the additional reference count will be assumed by the initialized `PythonObject`. The CPython API documentation indicates the ownership semantics of the returned object on any function that returns a `PyObject*` value. The two possible annotations are: * "Return value: New reference." * "Return value: Borrowed reference. This function should be used to construct a `PythonObject` from the pointer returned by 'Borrowed reference'-type objects. **Args:** * ​from\_borrowed\_ptr (`PyObjectPtr`): A read-only reference counted pointer to a Python object. **Returns:** An owned PythonObject pointer. `__init__[T: Movable & TypeIdentifiable](out self, *, owned alloc: T)` Allocate a new `PythonObject` and store a Mojo value in it. The newly allocated Python object will contain the provided Mojo `T` instance directly, without attempting conversion to an equivalent Python builtin type. Only Mojo types that have a registered Python 'type' object can be stored as a Python object. Mojo types are registered using a `PythonTypeBuilder`. **Parameters:** * ​T (`Movable & TypeIdentifiable`): The Mojo type of the value that the resulting Python object holds. **Args:** * ​alloc (`T`): The Mojo value to store in the new Python object. **Raises:** If no Python type object has been registered for `T` by a `PythonTypeBuilder`. `@implicit` `__init__(owned typed_obj: TypedPythonObject[type_hint]) -> Self` Construct a PythonObject from a typed object, dropping the type hint information. This is a no-op at runtime. The only information that is lost is static type information. **Args:** * ​typed\_obj (`TypedPythonObject[type_hint]`): The typed python object to unwrap. `@implicit` `__init__(none: NoneType) -> Self` Initialize a none value object from a `None` literal. **Args:** * ​none (`NoneType`): None. `@implicit` `__init__(value: Bool) -> Self` Initialize the object from a bool. **Args:** * ​value (`Bool`): The boolean value. `@implicit` `__init__(integer: Int) -> Self` Initialize the object with an integer value. **Args:** * ​integer (`Int`): The integer value. `@implicit` `__init__[dtype: DType](value: SIMD[dtype, 1]) -> Self` Initialize the object with a generic scalar value. If the scalar value type is bool, it is converted to a boolean. Otherwise, it is converted to the appropriate integer or floating point type. **Parameters:** * ​dtype (`DType`): The scalar value type. **Args:** * ​value (`SIMD[dtype, 1]`): The scalar value. `@implicit` `__init__(value: StringLiteral[value]) -> Self` Initialize the object from a string literal. **Args:** * ​value (`StringLiteral[value]`): The string value. `@implicit` `__init__(value: String) -> Self` Initialize the object from a string. **Args:** * ​value (`String`): The string value. `@implicit` `__init__(string: StringSlice[origin]) -> Self` Initialize the object from a string. **Args:** * ​string (`StringSlice[origin]`): The string value. `@implicit` `__init__(slice: Slice) -> Self` Initialize the object from a Mojo Slice. **Args:** * ​slice (`Slice`): The dictionary value. `__init__[*Ts: PythonConvertible](owned *values: *Ts, *, __list_literal__: Tuple[]) -> Self` Construct an Python list of objects. **Parameters:** * ​\*Ts (`PythonConvertible`): The types of the input values. **Args:** * ​\*values (`*Ts`): The values to initialize the list with. * ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals. **Returns:** The constructed Python list. `__init__[*Ts: PythonConvertible](out self, owned *values: *Ts, *, __set_literal__: Tuple[])` Construct an Python set of objects. **Parameters:** * ​\*Ts (`PythonConvertible`): The types of the input values. **Args:** * ​\*values (`*Ts`): The values to initialize the set with. * ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals. **Returns:** The constructed Python set. `__init__(out self, owned keys: List[PythonObject], owned values: List[PythonObject], __dict_literal__: Tuple[])` Construct a Python dictionary from a list of keys and a list of values. **Args:** * ​keys (`List[PythonObject]`): The keys of the dictionary. * ​values (`List[PythonObject]`): The values of the dictionary. * ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals. ### `__copyinit__` `__copyinit__(existing: Self) -> Self` Copy the object. This increments the underlying refcount of the existing object. **Args:** * ​existing (`Self`): The value to copy. ### `__del__` `__del__(owned self)` Destroy the object. This decrements the underlying refcount of the pointed-to object. ### `__bool__` `__bool__(self) -> Bool` Evaluate the boolean value of the object. **Returns:** Whether the object evaluates as true. ### `__getitem__` `__getitem__(self, *args: Self) -> Self` Return the value for the given key or keys. **Args:** * ​\*args (`Self`): The key or keys to access on this object. **Returns:** The value corresponding to the given key for this object. `__getitem__(self, *args: Slice) -> Self` Return the sliced value for the given Slice or Slices. **Args:** * ​\*args (`Slice`): The Slice or Slices to apply to this object. **Returns:** The sliced value corresponding to the given Slice(s) for this object. ### `__setitem__` `__setitem__(self, *args: Self, *, value: Self)` Set the value with the given key or keys. **Args:** * ​\*args (`Self`): The key or keys to set on this object. * ​value (`Self`): The value to set. ### `__neg__` `__neg__(self) -> Self` Negative. Calls the underlying object's `__neg__` method. **Returns:** The result of prefixing this object with a `-` operator. For most numerical objects, this returns the negative. ### `__pos__` `__pos__(self) -> Self` Positive. Calls the underlying object's `__pos__` method. **Returns:** The result of prefixing this object with a `+` operator. For most numerical objects, this does nothing. ### `__invert__` `__invert__(self) -> Self` Inversion. Calls the underlying object's `__invert__` method. **Returns:** The logical inverse of this object: a bitwise representation where all bits are flipped, from zero to one, and from one to zero. ### `__lt__` `__lt__(self, rhs: Self) -> Self` Less than (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__lt__` method, or if it fails. ### `__le__` `__le__(self, rhs: Self) -> Self` Less than or equal (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__le__` method, or if it fails. ### `__eq__` `__eq__(self, rhs: Self) -> Self` Equality (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__eq__` method, or if it fails. ### `__ne__` `__ne__(self, rhs: Self) -> Self` Inequality (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__ne__` method, or if it fails. ### `__gt__` `__gt__(self, rhs: Self) -> Self` Greater than (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__gt__` method, or if it fails. ### `__ge__` `__ge__(self, rhs: Self) -> Self` Greater than or equal (rich) comparison operator. **Args:** * ​rhs (`Self`): The value of the right hand side of the comparison. **Returns:** The result of the comparison, not necessarily a boolean. **Raises:** If the object doesn't implement the `__ge__` method, or if it fails. ### `__is__` `__is__(self, other: Self) -> Bool` Test if the PythonObject is the `other` PythonObject, the same as `x is y` in Python. **Args:** * ​other (`Self`): The right-hand-side value in the comparison. **Returns:** True if they are the same object and False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Test if the PythonObject is not the `other` PythonObject, the same as `x is not y` in Python. **Args:** * ​other (`Self`): The right-hand-side value in the comparison. **Returns:** True if they are not the same object and False otherwise. ### `__contains__` `__contains__(self, rhs: Self) -> Bool` Contains dunder. Calls the underlying object's `__contains__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** True if rhs is in self. ### `__add__` `__add__(self, rhs: Self) -> Self` Addition and concatenation. Calls the underlying object's `__add__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** The sum or concatenated values. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Subtraction. Calls the underlying object's `__sub__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** The difference. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Multiplication. Calls the underlying object's `__mul__` method. **Args:** * ​rhs (`Self`): Right hand value. **Returns:** The product. ### `__truediv__` `__truediv__(self, rhs: Self) -> Self` Division. Calls the underlying object's `__truediv__` method. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is divided. **Returns:** The result of dividing the right-hand-side value by this. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Return the division of self and rhs rounded down to the nearest integer. Calls the underlying object's `__floordiv__` method. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is divided. **Returns:** The result of dividing this by the right-hand-side value, modulo any remainder. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Return the remainder of self divided by rhs. Calls the underlying object's `__mod__` method. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Self) -> Self` Raises this object to the power of the given value. **Args:** * ​exp (`Self`): The exponent. **Returns:** The result of raising this by the given exponent. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Bitwise left shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the left. **Returns:** This value, shifted left by the given value. ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Bitwise right shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the right. **Returns:** This value, shifted right by the given value. ### `__and__` `__and__(self, rhs: Self) -> Self` Bitwise AND. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise AND'ed. **Returns:** The bitwise AND result of this and the given value. ### `__or__` `__or__(self, rhs: Self) -> Self` Bitwise OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise OR'ed. **Returns:** The bitwise OR result of this and the given value. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Exclusive OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is exclusive OR'ed. **Returns:** The exclusive OR result of this and the given value. ### `__radd__` `__radd__(self, lhs: Self) -> Self` Reverse addition and concatenation. Calls the underlying object's `__radd__` method. **Args:** * ​lhs (`Self`): The left-hand-side value to which this object is added or concatenated. **Returns:** The sum. ### `__rsub__` `__rsub__(self, lhs: Self) -> Self` Reverse subtraction. Calls the underlying object's `__rsub__` method. **Args:** * ​lhs (`Self`): The left-hand-side value from which this object is subtracted. **Returns:** The result of subtracting this from the given value. ### `__rmul__` `__rmul__(self, lhs: Self) -> Self` Reverse multiplication. Calls the underlying object's `__rmul__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is multiplied by this object. **Returns:** The product of the multiplication. ### `__rtruediv__` `__rtruediv__(self, lhs: Self) -> Self` Reverse division. Calls the underlying object's `__rtruediv__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is divided by this object. **Returns:** The result of dividing the given value by this. ### `__rfloordiv__` `__rfloordiv__(self, lhs: Self) -> Self` Reverse floor division. Calls the underlying object's `__rfloordiv__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is divided by this object. **Returns:** The result of dividing the given value by this, modulo any remainder. ### `__rmod__` `__rmod__(self, lhs: Self) -> Self` Reverse modulo. Calls the underlying object's `__rmod__` method. **Args:** * ​lhs (`Self`): The left-hand-side value that is divided by this object. **Returns:** The remainder from dividing the given value by this. ### `__rpow__` `__rpow__(self, lhs: Self) -> Self` Reverse power of. **Args:** * ​lhs (`Self`): The number that is raised to the power of this object. **Returns:** The result of raising the given value by this exponent. ### `__rlshift__` `__rlshift__(self, lhs: Self) -> Self` Reverse bitwise left shift. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the left by this object. **Returns:** The given value, shifted left by this. ### `__rrshift__` `__rrshift__(self, lhs: Self) -> Self` Reverse bitwise right shift. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the right by this object. **Returns:** The given value, shifted right by this. ### `__rand__` `__rand__(self, lhs: Self) -> Self` Reverse bitwise and. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise AND'ed with this object. **Returns:** The bitwise AND result of the given value and this. ### `__ror__` `__ror__(self, lhs: Self) -> Self` Reverse bitwise OR. **Args:** * ​lhs (`Self`): The left-hand-side value that is bitwise OR'ed with this object. **Returns:** The bitwise OR result of the given value and this. ### `__rxor__` `__rxor__(self, lhs: Self) -> Self` Reverse exclusive OR. **Args:** * ​lhs (`Self`): The left-hand-side value that is exclusive OR'ed with this object. **Returns:** The exclusive OR result of the given value and this. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Immediate addition and concatenation. **Args:** * ​rhs (`Self`): The right-hand-side value that is added to this object. ### `__isub__` `__isub__(mut self, rhs: Self)` Immediate subtraction. **Args:** * ​rhs (`Self`): The right-hand-side value that is subtracted from this object. ### `__imul__` `__imul__(mut self, rhs: Self)` In-place multiplication. Calls the underlying object's `__imul__` method. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is multiplied. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` Immediate division. **Args:** * ​rhs (`Self`): The value by which this object is divided. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` Immediate floor division. **Args:** * ​rhs (`Self`): The value by which this object is divided. ### `__imod__` `__imod__(mut self, rhs: Self)` Immediate modulo. **Args:** * ​rhs (`Self`): The right-hand-side value that is used to divide this object. ### `__ipow__` `__ipow__(mut self, rhs: Self)` Immediate power of. **Args:** * ​rhs (`Self`): The exponent. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Immediate bitwise left shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the left. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Immediate bitwise right shift. **Args:** * ​rhs (`Self`): The right-hand-side value by which this object is bitwise shifted to the right. ### `__iand__` `__iand__(mut self, rhs: Self)` Immediate bitwise AND. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise AND'ed. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Immediate exclusive OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is exclusive OR'ed. ### `__ior__` `__ior__(mut self, rhs: Self)` Immediate bitwise OR. **Args:** * ​rhs (`Self`): The right-hand-side value with which this object is bitwise OR'ed. ### `copy` `copy(self) -> Self` Copy the object. **Returns:** A copy of the value. ### `__iter__` `__iter__(self) -> _PyIter` Iterate over the object. **Returns:** An iterator object. **Raises:** If the object is not iterable. ### `__getattr__` `__getattr__(self, owned name: String) -> Self` Return the value of the object attribute with the given name. **Args:** * ​name (`String`): The name of the object attribute to return. **Returns:** The value of the object attribute with the given name. ### `__setattr__` `__setattr__(self, owned name: String, new_value: Self)` Set the given value for the object attribute with the given name. **Args:** * ​name (`String`): The name of the object attribute to set. * ​new\_value (`Self`): The new value to be set for that attribute. ### `__call__` `__call__(self, *args: Self, *, owned **kwargs: Self) -> Self` Call the underlying object as if it were a function. **Args:** * ​\*args (`Self`): Positional arguments to the function. * ​\*\*kwargs (`Self`): Keyword arguments to the function. **Returns:** The return value from the called object. **Raises:** If the function cannot be called for any reason. ### `__len__` `__len__(self) -> Int` Returns the length of the object. **Returns:** The length of the object. ### `__hash__` `__hash__(self) -> Int` Returns the hash value of the object. **Returns:** The hash value of the object. ### `__int__` `__int__(self) -> Self` Convert the PythonObject to a Python `int` (i.e. arbitrary precision integer). **Returns:** A Python `int` object. **Raises:** An error if the conversion failed. ### `__float__` `__float__(self) -> Self` Convert the PythonObject to a Python `float` object. **Returns:** A Python `float` object. **Raises:** If the conversion fails. ### `__str__` `__str__(self) -> Self` Convert the PythonObject to a Python `str`. **Returns:** A Python `str` object. **Raises:** An error if the conversion failed. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this Python object to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `to_python_object` `to_python_object(self) -> Self` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `unsafe_as_py_object_ptr` `unsafe_as_py_object_ptr(self) -> PyObjectPtr` Get the underlying PyObject pointer. Safety: Use-after-free: The caller must take care that `self` outlives the usage of the pointer returned by this function. **Returns:** The underlying PyObject pointer. ### `steal_data` `steal_data(owned self) -> PyObjectPtr` Take ownership of the underlying pointer from the Python object. **Returns:** The underlying data. ### `unsafe_get_as_pointer` `unsafe_get_as_pointer[dtype: DType](self) -> UnsafePointer[SIMD[dtype, 1]]` Reinterpret a Python integer as a Mojo pointer. Warning: converting from an integer to a pointer is unsafe! The compiler assumes the resulting pointer DOES NOT alias any Mojo-derived pointer. This is OK if the pointer originates from and is owned by Python, e.g. the data underpinning a torch tensor. **Parameters:** * ​dtype (`DType`): The desired DType of the pointer. **Returns:** An `UnsafePointer` for the underlying Python data. ### `downcast_value_ptr` `downcast_value_ptr[T: TypeIdentifiable](self, *, func: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)) -> UnsafePointer[T]` Get a pointer to the expected contained Mojo value of type `T`. This method validates that this object actually contains an instance of `T`, and will raise an error if it does not. Mojo values are stored as Python objects backed by the `PyMojoObject[T]` struct. **Parameters:** * ​T (`TypeIdentifiable`): The type of the Mojo value that this Python object is expected to contain. **Args:** * ​func (`Optional[StringSlice[StaticConstantOrigin]]`): Optional name of bound Mojo function that the raised TypeError should reference if downcasting fails. **Returns:** A pointer to the inner Mojo value. **Raises:** If the Python object does not contain an instance of the Mojo `T` type. ### `unchecked_downcast_value_ptr` `unchecked_downcast_value_ptr[T: AnyType](self) -> UnsafePointer[T]` Get a pointer to the expected Mojo value of type `T`. This function assumes that this Python object was allocated as an instance of `PyMojoObject[T]`. # Safety The user must be certain that this Python object type matches the bound Python type object for `T`. **Parameters:** * ​T (`AnyType`): The type of the Mojo value stored in this object. **Returns:** A pointer to the inner Mojo value. --- ## PythonTypeBuilder `struct PythonTypeBuilder` A builder for a Python 'type' binding. This is typically used to build a type description of a `PyMojoObject[T]`. This builder is used to declare method bindings for a Python type, and then create the type binding. Finalizing builder created with `PythonTypeObject.bind[T]()` will globally register the resulting Python 'type' object as the single canonical type object for the Mojo type `T`. Subsequent attempts to register a Python type for `T` will raise an exception. Registering a Python type object for `T` is necessary to be able to construct a `PythonObject` from an instance of `T`, or to downcast an existing `PythonObject` to a pointer to the inner `T` value. ## Fields * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module. * ​basicsize (`Int`): The required allocation size to hold an instance of this type as a Python object. * ​methods (`List[PyMethodDef]`): List of method definitions that will be exposed on the Python type. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, type_name: StringSlice[StaticConstantOrigin], *, basicsize: Int)` Construct a new builder for a Python type binding. **Args:** * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module. * ​basicsize (`Int`): The required allocation size to hold an instance of this type as a Python object. ### `bind` `static bind[T: Movable & Defaultable & Representable & TypeIdentifiable](type_name: StringSlice[StaticConstantOrigin]) -> Self` Construct a new builder for a Python type that binds a Mojo type. **Parameters:** * ​T (`Movable & Defaultable & Representable & TypeIdentifiable`): The mojo type to bind. **Args:** * ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module. **Returns:** A new type builder instance. ### `finalize` `finalize(mut self) -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Type")]` Finalize the builder and create a Python type object. This method completes the construction of a Python type object from the builder's configuration. The method ensures that each Mojo type has exactly one corresponding Python type object by registering the created type in a global registry. This prevents accidental creation of multiple type objects for the same Mojo type, which would break Python's type system assumptions. Note: After calling this method, the builder's internal state may be modified (methods list is consumed), so the builder should not be reused for creating additional type objects. TODO: This should be enforced programmatically in the future. **Returns:** A `TypedPythonObject["Type"]` representing the newly created Python type object that can be used to create instances or register with Python modules. **Raises:** If the Python type object creation fails, typically due to invalid type specifications or Python C API errors. `finalize(mut self, module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")])` Finalize the builder and add the created type to a Python module. This method completes the type building process by calling the parameterless `finalize()` method to create the Python type object, then automatically adds the resulting type to the specified Python module using the builder's configured type name. After successful completion, the builder's method list is cleared to prevent accidental reuse. This is a convenience method that combines type finalization and module registration in a single operation, which is the most common use case when creating Python-accessible Mojo types. Note: After calling this method, the builder's internal state is modified (methods list is cleared), so the builder should not be reused for creating additional type objects. If you need the type object for further operations, use the parameterless `finalize()` method instead and manually add it to the module. **Args:** * ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The Python module to which the finalized type will be added. The type will be accessible from Python code that imports this module using the name specified during builder construction. **Raises:** If the type object creation fails (see `finalize()` for details) or if adding the type to the module fails, typically due to name conflicts or module state issues. ### `def_py_c_method` `def_py_c_method(mut self, method: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PyObjectPtr signature for the type. **Args:** * ​method (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The method to declare a binding for. * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. ### `def_py_method` `def_py_method[method: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PyObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_py_method[method: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PyObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. ### `def_method` `def_method[method: fn(mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject) raises -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject) -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject) raises -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject) raises -> None`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject) raises -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject) raises -> None`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject) -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject) -> None`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject) -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject) -> None`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. `def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self` Declare a binding for a method with PythonObject signature for the type. **Parameters:** * ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None`): The method to declare a binding for. **Args:** * ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the type. * ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type. **Returns:** The builder with the method binding declared. --- ## q_smem_usage `q_smem_usage[: DType, : DType, : DType, : Bool, : IndexList[3], //, config: MatmulConfig[$0, $1, $2, $3, $4], group_size: Int]() -> Int` --- ## q4_k_dequantize_impl `q4_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin])` --- ## Q4sym `struct Q4sym[group_size: Int, float_dtype: DType = float32]` Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor. `group_size` determines the number of elements which share quantization parameters. We store things in a strided fashion: Example: Assume `group_size = 8` and we want to process uint4 numbers: A, B, C, D, E, F, G, H which have associated bits aaaa, bbbb, cccc, .... eeeeaaaa|ffffbbbb|ggggcccc|hhhhdddd To uncompress to floating point, take the decoded uint4 value, subtract the implicit zero-point of 2^4=8, and multiply by the scale factor. ## Parameters * ​group\_size (`Int`): The number of encoded numbers stored in this struct. * ​float\_dtype (`DType`): The floating point dtype this struct works with. ## Fields * ​scale (`StaticTuple[SIMD[uint8, 1], 2]`): The FP16 scale of the group, stored as individual bytes. * ​bits (`StaticTuple[SIMD[uint8, 1], (div_s(#lit.struct.extract, 2) + -1) if ((group_size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]`): The bits of the encoded uint4 numbers. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Construct a default initialized Q4sym. `@implicit` `__init__(out self, data: SIMD[float_dtype, group_size])` Construct an encoded Q4sym from data. **Args:** * ​data (`SIMD[float_dtype, group_size]`): The floating point data to encode and store. ### `decode_scale` `decode_scale(mut self) -> SIMD[float16, 1]` Obtain the scale factor. **Returns:** The decoded scale factor. ### `decode_unsigned` `decode_unsigned(mut self) -> SIMD[uint8, group_size]` Decode the stored uint4 numbers to uint8. **Returns:** The decoded stored numbers as uint8 numbers. These have an implicit zero-point of 8. ### `decode_signed` `decode_signed(mut self) -> SIMD[int8, group_size]` Decode the stored uint4 numbers to requantized int4 numbers. This is done by simply subtracting an implicit zp of 8 from the unsigned decoding. **Returns:** The decoded stored numbers as int8 numbers. These have a zero-point of 0\. ### `decode_fully` `decode_fully(mut self) -> SIMD[float_dtype, group_size]` Decode the stored numbers into floating point representation. **Returns:** The decoded numbers. ### `quantize_and_write_to_tensor` `static quantize_and_write_to_tensor[rank: Int](input_tensor: NDBuffer[float_dtype, rank, origin], output_tensor: NDBuffer[uint8, rank, origin], input_shape: IndexList[rank])` Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor. **Parameters:** * ​rank (`Int`): The rank of the input and output tensors. **Args:** * ​input\_tensor (`NDBuffer[float_dtype, rank, origin]`): The input tensor we are encoding. * ​output\_tensor (`NDBuffer[uint8, rank, origin]`): The output tensor containing the encoded input. The shape of the output should be the same as the input except along the inner dimension where if the original inner dimension was `d`, the corresponding output dimension should be: ceil(`d` / group\_size) \* sizeof(self). * ​input\_shape (`IndexList[rank]`): The shape of the input tensor. ### `dequantize_and_write_to_tensor` `static dequantize_and_write_to_tensor[rank: Int, //](input_tensor: NDBuffer[uint8, rank, origin], output_tensor: NDBuffer[float_dtype, rank, origin], output_shape: IndexList[rank])` Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor. **Parameters:** * ​rank (`Int`): The rank of the input and output tensors. **Args:** * ​input\_tensor (`NDBuffer[uint8, rank, origin]`): The input tensor we are decoding. * ​output\_tensor (`NDBuffer[float_dtype, rank, origin]`): The output tensor containing the decoded input. * ​output\_shape (`IndexList[rank]`): The shape of the output tensor. --- ## q6_k_dequantize_impl `q6_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin], output_shape: IndexList[2])` --- ## qmatmul ## Aliases ### `K_BATCH_SIZE` `alias K_BATCH_SIZE = 512` Defines the batch size of K used to pack A and unpack B weights. ## Functions * [​`matmul_qint4`](./matmul_qint4): * [​`matmul_qint4_pack_b`](./matmul_qint4_pack_b): --- ## qmatmul_gpu ## Functions * [​`args_to_tuple`](./args_to_tuple): * [​`gpu_qint4_repack_GPTQ`](./gpu_qint4_repack_GPTQ): * [​`gpu_qint4_repack_Q4_0`](./gpu_qint4_repack_Q4_0): * [​`matmul_gpu_qint4`](./matmul_gpu_qint4): * [​`matmul_gpu_qint4_impl`](./matmul_gpu_qint4_impl): * [​`multistage_gemm_q`](./multistage_gemm_q): * [​`multistage_mma_q`](./multistage_mma_q): * [​`multistage_qgemm_kernel`](./multistage_qgemm_kernel): * [​`pack_Q_tile`](./pack_Q_tile): * [​`q_smem_usage`](./q_smem_usage): * [​`repack_GPTQ_for_sm8x`](./repack_GPTQ_for_sm8x): * [​`repack_Q4_0_for_sm8x`](./repack_Q4_0_for_sm8x): * [​`unpack_4bit_int`](./unpack_4bit_int): --- ## qmatmul_k ## Functions * [​`matmul_Q4_K`](./matmul_Q4_K): * [​`matmul_Q4_K_pack_b`](./matmul_Q4_K_pack_b): * [​`matmul_Q6_K`](./matmul_Q6_K): * [​`matmul_Q6_K_pack_b`](./matmul_Q6_K_pack_b): --- ## qr_factorization ## Functions * [​`apply_q`](./apply_q): Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix. * [​`form_q`](./form_q): Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`. * [​`qr_factorization`](./qr_factorization): Performs QR factorization of a matrix `A` using the Householder reflector method. --- ## qr_factorization `qr_factorization[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Performs QR factorization of a matrix `A` using the Householder reflector method. This function computes the QR factorization of matrix `A` in-place using Householder reflections. The result is stored directly in the input matrix `A`, with scaling factors in `sigma`. The implementation follows the LAPACK algorithm for generating Householder reflectors in-place. Algorithm: The Householder reflector is defined as: U = I - σww^H where: w = (x + νe₁)/ξ σ = ξ/ν ξ = x₀ + ν ν = sign(x₀)‖x‖₂ ``` This ensures that U^H x = -νe₁ and U^H U = I. ``` References: \[1] Lehoucq, R. B. (1996). The computation of elementary unitary matrices. ACM Transactions on Mathematical Software, 22(4), 393-400. Note: There is a typo in reference \[lawn72]. The correct result is U^H x = -νe₁. --- ## quantization APIs to quantize graph tensors. This package includes a comprehensive set of tools for working with quantized models in MAX Graph. It defines supported quantization encodings, configuration parameters that control the quantization process, and block parameter specifications for different quantization formats. The module supports various quantization formats including 4-bit, 5-bit, and 6-bit precision with different encoding schemes. It also provides support for GGUF-compatible formats for interoperability with other frameworks. ## `BlockParameters` {#max.graph.quantization.BlockParameters} > *class* max.graph.quantization.BlockParameters(elements\_per\_block, block\_size) Parameters describing the structure of a quantization block. Block-based quantization stores elements in fixed-size blocks. Each block contains a specific number of elements in a compressed format. **Parameters:** * **elements\_per\_block** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **block\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `block_size` {#max.graph.quantization.BlockParameters.block_size} > block\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `elements_per_block` {#max.graph.quantization.BlockParameters.elements_per_block} > elements\_per\_block\*: [int](https://docs.python.org/3/library/functions.html#int)\* ## `QuantizationConfig` {#max.graph.quantization.QuantizationConfig} > *class* max.graph.quantization.QuantizationConfig(quant\_method, bits, group\_size, desc\_act=False, sym=False) Configuration for specifying quantization parameters that affect inference. These parameters control how tensor values are quantized, including the method, bit precision, grouping, and other characteristics that affect the trade-off between model size, inference speed, and accuracy. **Parameters:** * **quant\_method** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **bits** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **group\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **desc\_act** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **sym** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `bits` {#max.graph.quantization.QuantizationConfig.bits} > bits\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `desc_act` {#max.graph.quantization.QuantizationConfig.desc_act} > desc\_act\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* ### `group_size` {#max.graph.quantization.QuantizationConfig.group_size} > group\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `quant_method` {#max.graph.quantization.QuantizationConfig.quant_method} > quant\_method\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* ### `sym` {#max.graph.quantization.QuantizationConfig.sym} > sym\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False* ## `QuantizationEncoding` {#max.graph.quantization.QuantizationEncoding} > *class* max.graph.quantization.QuantizationEncoding(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) Quantization encodings supported by MAX Graph. Each encoding represents a different method of quantizing model weights with specific trade-offs between compression ratio, accuracy, and computational efficiency. ### `GPTQ` {#max.graph.quantization.QuantizationEncoding.GPTQ} > GPTQ *= 'GPTQ'* ### `Q4_0` {#max.graph.quantization.QuantizationEncoding.Q4_0} > Q4\_0 *= 'Q4\_0'* ### `Q4_K` {#max.graph.quantization.QuantizationEncoding.Q4_K} > Q4\_K *= 'Q4\_K'* ### `Q5_K` {#max.graph.quantization.QuantizationEncoding.Q5_K} > Q5\_K *= 'Q5\_K'* ### `Q6_K` {#max.graph.quantization.QuantizationEncoding.Q6_K} > Q6\_K *= 'Q6\_K'* ### `block_parameters` {#max.graph.quantization.QuantizationEncoding.block_parameters} > *property* block\_parameters\*: [BlockParameters](#max.graph.quantization.BlockParameters)\* Gets the block parameters for this quantization encoding. **Returns:** The parameters describing how elements are organized and encoded in blocks for this quantization encoding. **Return type:** [BlockParameters](#max.graph.quantization.BlockParameters) ### `block_size` {#max.graph.quantization.QuantizationEncoding.block_size} > *property* block\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\* Number of bytes in encoded representation of block. All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of bytes resulting after encoding a single block. **Returns:** Size in bytes of each encoded quantization block. **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `elements_per_block` {#max.graph.quantization.QuantizationEncoding.elements_per_block} > *property* elements\_per\_block\*: [int](https://docs.python.org/3/library/functions.html#int)\* Number of elements per block. All quantization types currently supported by MAX Graph are block-based: groups of a fixed number of elements are formed, and each group is quantized together into a fixed-size output block. This value is the number of elements gathered into a block. **Returns:** Number of original tensor elements in each quantized block. **Return type:** [int](https://docs.python.org/3/library/functions.html#int) ### `is_gguf` {#max.graph.quantization.QuantizationEncoding.is_gguf} > *property* is\_gguf\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* Checks if this quantization encoding is compatible with GGUF format. GGUF is a format for storing large language models and compatible quantized weights. **Returns:** True if this encoding is compatible with GGUF, False otherwise. **Return type:** [bool](https://docs.python.org/3/library/functions.html#bool) ### `name` {#max.graph.quantization.QuantizationEncoding.name} > *property* name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* Gets the lowercase name of the quantization encoding. **Returns:** Lowercase string representation of the quantization encoding. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) --- ## quantization This package contains a set of APIs for quantizing tensor data. Quantization is a technique used to reduce the precision of floating-point numbers, which are used in most neural networks. Quantization is a type of lossy compression, which means that some precision is lost, but the resulting tensors take less memory and computations are faster. ## Modules * [​`per_channel_grouped_4bit`](./per_channel_grouped_4bit/): * [​`qmatmul`](./qmatmul/): * [​`qmatmul_gpu`](./qmatmul_gpu/): * [​`qmatmul_k`](./qmatmul_k/): --- ## Quantization MAX allows you to load and run pre-quantized models through both its Python API and CLI. This guide explains quantization concepts and how to work with quantized models in your applications. ## Understanding quantization Quantization reduces the numeric precision of model weights to decrease memory usage and increase inference speed. For example, models originally trained with `float32` weights can be represented using lower precision types like `int8` or `int4`, reducing each scalar value from 32 bits to 8 or 4 bits. When used properly, quantization does not significantly affect the model accuracy. There are several different quantization encodings that provide different levels of precision and encoding formats, each with its own trade-offs that may work well for some models or graph operations ("ops") but not others. Some models also work great with a mixture of quantization types, so that only certain ops perform low-precision calculations while others retain high precision. ## How to load pre-quantized models with MAX You can load pre-quantized models using two primary approaches: - By specifying a path to a quantized weight file - By specifying the quantization encoding format for compatible models When you have a quantized weight file, you can load it directly using the `--weight-path` argument: ```bash max serve --model-path=meta-llama/Llama-3.1-8B-Instruct \ --weight-path=bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf ``` MAX automatically detects the quantization format from the weight file. This approach works for models with standard quantization formats like GGUF and AWQ. For models that have been quantized using specific techniques but don't use a separate weight file format, you can specify the quantization encoding directly with the `--quantization-encoding` flag: ```bash max generate --model-path=hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \ --quantization-encoding=gptq \ --prompt "What is the meaning of life?" ``` The `--quantization-encoding` flag accepts the following values: - `float32`: Full precision 32-bit floating point. - `bfloat16`: Brain floating point 16-bit format. - `q4_0`: 4-bit quantization format. - `q4_k`: 4-bit quantization with K-means clustering. - `q6_k`: 6-bit quantization with K-means clustering. - `gptq`: Specialized quantization optimized for transformer-based models. For more information on the `max` CLI, see the [MAX CLI](/max/max-cli) documentation or the [MAX Serve API reference ](/max/api/serve). ## Quantized layer implementation For developers building custom models with the MAX Graph API you can implement custom quantized layers. This is useful when: - You're building a model from scratch using the MAX Graph API - You need precise control over how quantization is implemented - You're implementing specialized model architectures that require custom quantized operations To implement a quantized layer in Python, you'll need to make a few key changes compared to a standard linear layer. Let's look at the differences. A standard linear layer in MAX might look like this: ```python from max import nn from max.dtype import DType from max.graph import DeviceRef, Weight class Linear(nn.Module): def __init__(self, in_dim, out_dim): super().__init__() self.weight = Weight( name="weight", dtype=DType.float32, shape=[in_dim, out_dim], device=DeviceRef.CPU(), ) self.bias = Weight(name="bias", dtype=DType.float32, shape=[out_dim]) def __call__(self, x): return x @ self.weight.T.to(x.device) + self.bias.to(x.device) ``` To enable support for GGUF quantization like [`Q4_0`](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_0), [`Q4_K`](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_K), or other encodings, you need to: 1. Load weights from the quantized model checkpoint as `uint8` with the appropriate shape. 2. Replace the standard matrix multiplication `(@)` with the [`qmatmul`](/max/api/python/graph/ops#max.graph.ops.qmatmul) operation. 3. Specify the quantization encoding to use. Here's how you might implement a quantized linear layer: ```python from max import nn from max.dtype import DType from max.graph import DeviceRef, Weight, ops from max.graph.quantization import QuantizationEncoding class QuantizedLinear(nn.Module): def __init__(self, in_dim, out_dim, quantization_encoding): super().__init__() self.weight = Weight( name="weight", # The DType must be uint8. dtype=DType.uint8, # This shape must be updated to match the quantized shape shape=[in_dim, out_dim], device=DeviceRef.CPU(), quantization_encoding=quantization_encoding, ) self.bias = Weight(name="bias", dtype=DType.float32, shape=[out_dim]) def __call__(self, x): return ops.qmatmul( self.weight.quantization_encoding, None, x, self.weight.to(x.device) ) + bias.to(x.device) quantized_linear = QuantizedLinear(in_dim, out_dim, QuantizationEncoding.Q4_0) ``` The [MAX graph quantization](/max/api/python/graph/quantization) class defines the available quantization formats supported by MAX. These encodings include: - [Q4_0](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_0): 4-bit quantization format - [Q4_K](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_K): 4-bit quantization with K-means clustering - [Q5_K](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q5_K): 5-bit quantization with K-means clustering - [Q6_K](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q6_K): 6-bit quantization with K-means clustering - [GPTQ](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.GPTQ): Specialized quantization optimized for transformer-based models With this implementation, you can add quantized weights into your MAX models. The [`qmatmul`](/max/api/python/graph/ops#max.graph.ops.qmatmul) operation handles the dequantization process during inference, giving you the performance benefits of quantization without having to manage the low-level details. --- ## quantize_dynamic_scaled_fp8 `quantize_dynamic_scaled_fp8[out_dtype: DType, in_dtype: DType, scales_dtype: DType, //, group_size_or_per_token: Int](scaled_output: NDBuffer[out_dtype, 2, origin, shape, strides], scales: NDBuffer[scales_dtype, 2, origin, shape, strides], input: NDBuffer[in_dtype, 2, origin, shape, strides], scale_ub: SIMD[float32, 1], ctx: DeviceContext)` --- ## quantize_fp8_kernel `quantize_fp8_kernel[out_type: DType, scales_type: DType, in_type: DType, warps_per_block: Int, group_size: Int](output: NDBuffer[out_type, 2, MutableAnyOrigin], scales: NDBuffer[scales_type, 2, MutableAnyOrigin], input: NDBuffer[in_type, 2, MutableAnyOrigin], scale_ub: SIMD[scales_type, 1])` --- ## quantize_static_scaled_fp8 `quantize_static_scaled_fp8[out_dtype: DType, in_dtype: DType, is_scale_inverted: Bool = True](out_buffer: NDBuffer[out_dtype, 2, origin, shape, strides], in_buffer: NDBuffer[in_dtype, 2, origin, shape, strides], scale: SIMD[float32, 1], context: DeviceContext)` --- ## QueuedTileScheduler `@register_passable(trivial)` `struct QueuedTileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, decoding: Bool, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]` If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`. ## Fields * ​gidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHATileScheduler`, `Movable`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance = True` ### `mha_schedule` `alias mha_schedule = schedule` ## Methods ### `__init__` `__init__(gidx_ptr: UnsafePointer[SIMD[uint32, 1]]) -> Self` ### `get_current_work_info` `get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` The parameter `func` must return a `Bool` indicating whether the `WorkInfo` arg is valid. This function returns whether the current idx corresponds to a valid `WorkInfo`. Note that if `MHASchedulerSynchronization` is `NONE`, then we assume it is only called by `thread_idx.x==0`. ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `initial_state` `initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## quick_bench ## Structs * [​`QuickBench`](/mojo/stdlib/benchmark/quick_bench/QuickBench): Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate. --- ## QuickBench `struct QuickBench` Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate. ## Fields * ​m (`Bench`): Bench object to collect the results. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Just initialize the Bench object. ### `dump_report` `dump_report(mut self)` Prints out the report from a Benchmark execution collected in Bench object. ### `run` `run[T_out: AnyTrivialRegType](mut self, func: fn() -> T_out, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with no input arguments and return type `T_out`. **Parameters:** * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn() -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0) -> T_out, x0: T0, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 1 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1) -> T_out, x0: T0, x1: T1, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 2 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2) -> T_out, x0: T0, x1: T1, x2: T2, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 3 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 4 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 5 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 6 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 7 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 8 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​x7 (`T7`): The 8th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 9 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func. * ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​x7 (`T7`): The 8th argument of func. * ​x8 (`T8`): The 9th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. `run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, T9: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, x9: T9, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())` Benchmark function `func` with 10 input argument and return type `T_out`. **Parameters:** * ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func. * ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func. * ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func. * ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func. * ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func. * ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func. * ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func. * ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func. * ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func. * ​T9 (`AnyTrivialRegType`): Type of the 10th argument of func. * ​T\_out (`AnyTrivialRegType`): Output type of func. **Args:** * ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out`): The function to be benchmarked (run in benchmark iterations). * ​x0 (`T0`): The 1st argument of func. * ​x1 (`T1`): The 2nd argument of func. * ​x2 (`T2`): The 3rd argument of func. * ​x3 (`T3`): The 4th argument of func. * ​x4 (`T4`): The 5th argument of func. * ​x5 (`T5`): The 6th argument of func. * ​x6 (`T6`): The 7th argument of func. * ​x7 (`T7`): The 8th argument of func. * ​x8 (`T8`): The 9th argument of func. * ​x9 (`T9`): The 10th argument of func. * ​bench\_id (`BenchId`): The benchmark Id object used for identification. * ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's. --- ## Quickstart import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import TutorialStack from '@site/src/components/TutorialStack'; import ContactSection from '@site/src/components/ContactSection'; import Requirements from '@site/src/components/Requirements'; import { requirementsOptionalDocker } from './requirements'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; In this quickstart guide, you'll learn how to install Modular in a Python environment and run inference with a GenAI model. We'll first use our Python API to run offline inference, then start a local endpoint and use the OpenAI Python API to send inference requests. System requirements: ## Set up your project First, install the `max` CLI and Python library: :::note When using `pip`, we use the `--index-url` argument to ensure that `torch` installs CPU dependencies only, avoiding a lot of unnecessary GPU packages. This is a temporary workaround until we can remove all dependencies on PyTorch. ::: ## Run offline inference You can run inference locally with the `max` Python API. Just specify the Hugging Face model you want and then generate results with one or more prompts. In this example, we use a Llama 3.1 model that's not gated on Hugging Face, so you don't need an access token: ```python title="offline-inference.py" from max.entrypoints.llm import LLM from max.pipelines import PipelineConfig def main(): model_path = "modularai/Llama-3.1-8B-Instruct-GGUF" pipeline_config = PipelineConfig(model_path=model_path) llm = LLM(pipeline_config) prompts = [ "In the beginning, there was", "I believe the meaning of life is", "The fastest way to learn python is", ] print("Generating responses...") responses = llm.generate(prompts, max_new_tokens=50) for i, (prompt, response) in enumerate(zip(prompts, responses)): print(f"========== Response {i} ==========") print(prompt + response) print() if __name__ == "__main__": main() ``` Run it and you should see a response similar to this: ```sh python offline-inference.py ``` ```output ========== Response 0 ========== In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's ========== Response 1 ========== I believe the meaning of life is to find your gift. The purpose of life is to give it away to others. I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others. I believe that the meaning of life is ========== Response 2 ========== The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python: 1. **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division. ``` More information about this API is available in the [offline inference guide](/max/serve/offline-inference). ## Run inference with an endpoint Now let's start a local server that runs the model using an OpenAI-compatible endpoint: 1. Install the `openai` client library: ```bash pip install openai ``` ```bash uv add openai ``` ```bash magic add openai ``` 2. Start the endpoint with the [`max`](/max/max-cli) CLI: ```python max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF ``` 3. Create a new file that sends an inference request: ```python title="generate-text.py" from openai import OpenAI client = OpenAI( base_url="http://0.0.0.0:8000/v1", api_key="EMPTY", ) completion = client.chat.completions.create( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[ { "role": "user", "content": "Who won the world series in 2020?" }, ], ) print(completion.choices[0].message.content) ``` Notice that the `OpenAI` API requires the `api_key` argument, but our endpoint doesn't use it. 4. Run it and you should see results like this: ```sh python generate-text.py ``` ```output The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988. ``` That's it. You just served Llama 3 on your local CPU and ran inference using our OpenAI-compatible [Serve API](/max/api/serve). You can also [deploy the same endpoint to a cloud GPU](/max/tutorials/max-serve-local-to-cloud) using our [Docker container](/max/container). To run a different model, change the `--model-path` to something else from [our model repository](https://builds.modular.com/?category=models). ## Keep going There's still a lot more to learn. Here are some directions you can go: import SmallCards from '@site/src/components/SmallCards'; import { ArrowTransfer } from '@site/src/shared/Svgs/ArrowTransfer'; import { ArrowCloud } from '@site/src/shared/Svgs/ArrowCloud'; import { DesktopCode } from '@site/src/shared/Svgs/DesktopCode'; import { AIChip } from '@site/src/shared/Svgs/AIChip'; import { RecipesIcon } from '@site/src/shared/Svgs/RecipesIcon'; import { OpenBook } from '@site/src/shared/Svgs/OpenBook'; import { PuzzleIcon } from '@site/src/shared/Svgs/PuzzleIcon'; import { AIBrainIcon } from '@site/src/shared/Svgs/AIBrainIcon'; ### Docs export const docs = [ { title: 'Serving', link: '/max/serve/', description: `Try more serving features like function calling, tool use, \ structured output, and more.`, icon: , }, { title: 'Deploying', link: '/max/deploy/', description: `Try a tutorial to deploy a model on a cloud GPU using \ our Docker container.`, icon: , }, { title: 'Developing', link: '/max/develop/', description: `Discover all the ways you can customize your AI \ deployments, such as writing custom ops and GPU kernels in Mojo.`, icon: , }, { title: 'Mojo manual', link: '/mojo/manual/', description: `Learn to program in Mojo, a Pythonic systems programming \ language that allows you to write code for both CPUs and GPUs.`, icon: , }, ]; ### Resources export const resources = [ { title: 'Model repo', link: 'https://builds.modular.com/?category=models', description: `Hundreds of GenAI models accelerated with Modular.`, icon: , }, { title: 'Tutorials', link: '/max/tutorials/', description: `Step-by-step procedures to develop and deploy with Modular.`, icon: , }, { title: 'Recipes', link: 'https://builds.modular.com/?category=recipes', description: `Turn-key applications that use GenAI models with Modular.`, icon: , }, { title: 'GPU puzzles', link: 'https://builds.modular.com/puzzles', description: `A hands-on guide to mastering GPU programming with Mojo.`, icon: , }, ]; ## Stay in touch --- ## radix_sort_pairs_kernel `radix_sort_pairs_kernel[type: DType, out_idx_type: DType, current_bit: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](input_keys_: UnsafePointer[SIMD[type, 1]], output_keys_: UnsafePointer[SIMD[type, 1]], input_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], output_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], num_keys: Int, skip_sort: UnsafePointer[SIMD[bool, 1]])` Radix pair sort kernel for (default) descending order. Implementation based on: AMD. Introduction to GPU Radix Sort. GPUOpen, 2017. Available at: . **Parameters:** * ​type (`DType`): DType - Data type. * ​out\_idx\_type (`DType`): DType - Output index type. * ​current\_bit (`Int`): Int - Current bit to start sorting NUM\_BITS\_PER\_PASS bits at. * ​ascending (`Bool`): Bool - Whether to sort in ascending order. * ​BLOCK\_SIZE (`Int`): Int - Block size. * ​NUM\_BITS\_PER\_PASS (`Int`): Int - Number of bits per pass. **Args:** * ​input\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Input tensor values to sort. * ​output\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Output tensor values sorted in (default) descending order. * ​input\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Input tensor indices. * ​output\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Output tensor indices sorted in (default) descending order. * ​num\_keys (`Int`): Number of keys to sort per batch. * ​skip\_sort (`UnsafePointer[SIMD[bool, 1]]`): Whether sorting is skipped for this batch. --- ## Ragged tensors Ragged tensors is a method for batching multiple requests with differing sequence lengths without the need for [padding tokens](padding-tokens.mdx). Ragged tensors allow sequences of variable lengths to be processed together efficiently by storing them in a compact, non-uniform format. Also sometimes referred to as "packed tensors." --- ## ragged_attention An opaque KV Cache optimized vanilla attention mechanism, with Mask Variants provided inside the Kernel. ## `RaggedAttention` {#max.nn.attention.ragged_attention.RaggedAttention} > *class* max.nn.attention.ragged\_attention.RaggedAttention(\*, mask\_variant, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, devices=None, dtype=float32, linear\_cls=\, stacked\_qkv=False, scale=None, has\_bias=False, clip\_qkv=None) Layer that computes the self attention score for ragged inputs. Initializes the attention layer. **Parameters:** * **rope** – The rope layer to borrow the freq\_cis value from. * **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads. * **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads. * **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states. * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type. * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the * **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` `|` `None` ) – Device to place the weights and run the computation. If multiple are provided, the first device is used. * **linear\_cls** (`Callable` `[` `...` `,` [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer. * **stacked\_qkv** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether the weights are stacked together. * **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – Value used to scale the results of the attention output. * **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias. * **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) – If provided, the QKV weights are clamped between \[-clip\_qkv, clip\_qkv] * **mask\_variant** ([`MHAMaskVariant`](../kernels.md#max.nn.kernels.MHAMaskVariant) ) ### `wqkv` {#max.nn.attention.ragged_attention.RaggedAttention.wqkv} > *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\* The concatenation of q, k, and v weight vectors. --- ## RaggedMHAOperand `@register_passable(trivial)` `struct RaggedMHAOperand[type_: DType, shape: DimList, stride: DimList]` An implementation for ragged NDBuffer arguments to MHA kernels. ## Fields * ​buffer (`NDBuffer[type_, 3, MutableAnyOrigin, shape, stride]`): * ​cache\_row\_offsets (`NDBuffer[uint32, 1, MutableAnyOrigin]`): ## Implemented traits `AnyType`, `Copyable`, `MHAOperand`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = type_` ## Methods ### `__init__` `__init__(buffer: NDBuffer[type_, 3, MutableAnyOrigin, shape, stride], cache_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides]) -> Self` ### `block_paged_ptr` `block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]` ### `cache_length` `cache_length(self, batch_idx: Int) -> Int` ### `max_context_length` `max_context_length(self) -> SIMD[uint32, 1]` --- ## RaisingCoroutine `@register_passable` `struct RaisingCoroutine[type: AnyType, origins: origin.set]` Represents a coroutine that can raise. Coroutines can pause execution saving the state of the program (including values of local variables and the location of the next instruction to be executed). When the coroutine is resumed, execution continues from where it left off, with the saved state restored. ## Parameters * ​type (`AnyType`): Type of value returned upon completion of the coroutine. * ​origins (`origin.set`): The origin set of the coroutine's captures. ## Implemented traits `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(handle: !co.routine) -> Self` Construct a coroutine object from a handle. **Args:** * ​handle (`!co.routine`): The init handle. ### `__await__` `__await__(owned self, out result: type)` Suspends the current coroutine until the coroutine is complete. **Returns:** The coroutine promise. ### `force_destroy` `force_destroy(owned self)` Destroy the coroutine object. --- ## rand `rand[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, /, *, min: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), int_scale: Optional[Int] = Optional(None))` Fills memory with random values from a uniform distribution. **Parameters:** * ​dtype (`DType`): The dtype of the pointer. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill. * ​size (`Int`): The number of elements to fill. * ​min (`SIMD[float64, 1]`): The minimum value for random. * ​max (`SIMD[float64, 1]`): The maximum value for random. * ​int\_scale (`Optional[Int]`): The scale for error checking (float type only). --- ## rand_uniform ## Functions * [​`random_uniform`](./random_uniform): Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types. --- ## randint `randint[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1]], size: Int, low: Int, high: Int)` Fills memory with uniform random in range \[low, high]. **Constraints:** The type should be integral. **Parameters:** * ​dtype (`DType`): The dtype of the pointer. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1]]`): The pointer to the memory area to fill. * ​size (`Int`): The number of elements to fill. * ​low (`Int`): The minimal value for random. * ​high (`Int`): The maximal value for random. --- ## randn ## Functions * [​`random_normal`](./random_normal): Fill `output` with values generated from Normal(mean, variance) distribution. --- ## randn `randn[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1))` Fills memory with random values from a Normal(mean, standard\_deviation) distribution. **Constraints:** The type should be floating point. **Parameters:** * ​dtype (`DType`): The dtype of the pointer. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill. * ​size (`Int`): The number of elements to fill. * ​mean (`SIMD[float64, 1]`): Normal distribution mean. * ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation. --- ## randn_float64 `randn_float64(mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1)) -> SIMD[float64, 1]` Returns a random double sampled from a Normal(mean, standard\_deviation) distribution. **Args:** * ​mean (`SIMD[float64, 1]`): Normal distribution mean. * ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation. **Returns:** A random float64 sampled from Normal(mean, standard\_deviation). --- ## random Random number generation for GPU kernels. This module implements a high-performance random number generator using the Philox algorithm, which is designed for parallel and GPU computing. The Philox algorithm is a counter-based random number generator that provides high-quality random numbers with excellent statistical properties. The main class is Random which generates both uniform random numbers and raw 32-bit integers. It supports: * Seeding for reproducible sequences * Multiple independent subsequences * Configurable number of rounds for quality vs performance tradeoff * Vectorized operations for efficiency Example: ```mojo from gpu.random import Random rng = Random(seed=42) uniform_values = rng.step_uniform() # Returns 4 random floats in [0,1) raw_values = rng.step() # Returns 4 raw 32-bit integers ``` ## Structs * [​`Random`](/mojo/stdlib/gpu/random/Random): A high-performance random number generator using the Philox algorithm. --- ## random Implements the random package. ## Modules * [​`random`](/mojo/stdlib/random/random/): Provides functions for random numbers. --- ## random Provides functions for random numbers. You can import these APIs from the `random` package. For example: ```mojo from random import seed ``` ## Functions * [​`rand`](/mojo/stdlib/random/random/rand): Fills memory with random values from a uniform distribution. * [​`randint`](/mojo/stdlib/random/random/randint): Fills memory with uniform random in range \[low, high]. * [​`randn`](/mojo/stdlib/random/random/randn): Fills memory with random values from a Normal(mean, standard\_deviation) distribution. * [​`randn_float64`](/mojo/stdlib/random/random/randn_float64): Returns a random double sampled from a Normal(mean, standard\_deviation) distribution. * [​`random_float64`](/mojo/stdlib/random/random/random_float64): Returns a random `Float64` number from the given range. * [​`random_si64`](/mojo/stdlib/random/random/random_si64): Returns a random `Int64` number from the given range. * [​`random_ui64`](/mojo/stdlib/random/random/random_ui64): Returns a random `UInt64` number from the given range. * [​`seed`](/mojo/stdlib/random/random/seed): Seeds the random number generator using the current time. * [​`shuffle`](/mojo/stdlib/random/random/shuffle): Shuffles the elements of the list randomly. --- ## Random `struct Random[rounds: Int = 6]` A high-performance random number generator using the Philox algorithm. The Philox algorithm is a counter-based random number generator designed for parallel and GPU computing. It provides high-quality random numbers with excellent statistical properties. ## Parameters * ​rounds (`Int`): Number of mixing rounds to perform. Higher values provide better statistical quality at the cost of performance. Default is 6. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, seed: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), subsequence: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), offset: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0))` Initialize the random number generator. **Args:** * ​seed (`SIMD[uint64, 1]`): Initial seed value for reproducible sequences. Default is 0. * ​subsequence (`SIMD[uint64, 1]`): Subsequence number for generating independent streams. Default is 0. * ​offset (`SIMD[uint64, 1]`): Starting offset in the sequence. Default is 0. ### `step` `step(mut self) -> SIMD[uint32, 4]` Generate 4 random 32-bit unsigned integers. **Returns:** SIMD vector containing 4 random 32-bit unsigned integers. ### `step_uniform` `step_uniform(mut self) -> SIMD[float32, 4]` Generate 4 random floating point numbers uniformly distributed in \[0,1). **Returns:** SIMD vector containing 4 random float32 values in range \[0,1). --- ## random_float64 `random_float64(min: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> SIMD[float64, 1]` Returns a random `Float64` number from the given range. **Args:** * ​min (`SIMD[float64, 1]`): The minimum number in the range (default is 0.0). * ​max (`SIMD[float64, 1]`): The maximum number in the range (default is 1.0). **Returns:** A random number from the specified range. --- ## random_normal `random_normal[type: DType, mean: SIMD[float64, 1], variance: SIMD[float64, 1]](output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Fill `output` with values generated from Normal(mean, variance) distribution. **Args:** * ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer. --- ## random_si64 `random_si64(min: SIMD[int64, 1], max: SIMD[int64, 1]) -> SIMD[int64, 1]` Returns a random `Int64` number from the given range. **Args:** * ​min (`SIMD[int64, 1]`): The minimum number in the range. * ​max (`SIMD[int64, 1]`): The maximum number in the range. **Returns:** A random number from the specified range. --- ## random_ui64 `random_ui64(min: SIMD[uint64, 1], max: SIMD[uint64, 1]) -> SIMD[uint64, 1]` Returns a random `UInt64` number from the given range. **Args:** * ​min (`SIMD[uint64, 1]`): The minimum number in the range. * ​max (`SIMD[uint64, 1]`): The maximum number in the range. **Returns:** A random number from the specified range. --- ## random_uniform `random_uniform[: origin.set, dtype: DType, rank: Int, //, output_fn: fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None, target: StringSlice[StaticConstantOrigin]](shape: IndexList[rank], lower_bound: SIMD[dtype, 1], upper_bound: SIMD[dtype, 1], seed_value: SIMD[uint64, 1], ctx: DeviceContextPtr)` Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types. **Parameters:** * ​dtype (`DType`): The data type to generate. * ​rank (`Int`): The rank of the underlying buffer. * ​output\_fn (`fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None`): The function which stores the generated values. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​shape (`IndexList[rank]`): The shape of the output being stored into by output\_fn. * ​lower\_bound (`SIMD[dtype, 1]`): The lower bound on the uniform range. * ​upper\_bound (`SIMD[dtype, 1]`): The upper bound on the uniform range. * ​seed\_value (`SIMD[uint64, 1]`): Seed value used to initialize the random number generator. * ​ctx (`DeviceContextPtr`): The device context. --- ## range Implements a 'range' call. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`range`](/mojo/stdlib/builtin/range/range): Constructs a \[0; end) Range. --- ## range `range[T: Indexer, //](end: T) -> _ZeroStartingRange` Constructs a \[0; end) Range. **Parameters:** * ​T (`Indexer`): The type of the end value. **Args:** * ​end (`T`): The end of the range. **Returns:** The constructed range. `range[T: IntableRaising, //](end: T) -> _ZeroStartingRange` Constructs a \[0; end) Range. **Parameters:** * ​T (`IntableRaising`): The type of the end value. **Args:** * ​end (`T`): The end of the range. **Returns:** The constructed range. **Raises:** An error if the conversion to an `Int` failed. `range(end: PythonObject) -> _ZeroStartingRange` Constructs a \[0; end) Range from a Python `int`. **Args:** * ​end (`PythonObject`): The end of the range as a Python `int`. **Returns:** The constructed range. **Raises:** An error if converting `end` to an `Int` failed. `range[T0: Indexer, T1: Indexer, //](start: T0, end: T1) -> _SequentialRange` Constructs a \[start; end) Range. **Parameters:** * ​T0 (`Indexer`): The type of the start value. * ​T1 (`Indexer`): The type of the end value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. **Returns:** The constructed range. `range[T0: IntableRaising, T1: IntableRaising](start: T0, end: T1) -> _SequentialRange` Constructs a \[start; end) Range. **Parameters:** * ​T0 (`IntableRaising`): The type of the start value. * ​T1 (`IntableRaising`): The type of the end value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. **Returns:** The constructed range. **Raises:** An error if converting `start` or `end` to an `Int` failed. `range(start: PythonObject, end: PythonObject) -> _SequentialRange` Constructs a \[start; end) Range from Python `int` objects. **Args:** * ​start (`PythonObject`): The start of the range as a Python `int`. * ​end (`PythonObject`): The end of the range as a Python `int`. **Returns:** The constructed range. **Raises:** An error if converting `start` or `end` to an `Int` failed. `range[T0: Indexer, T1: Indexer, T2: Indexer, //](start: T0, end: T1, step: T2) -> _StridedRange` Constructs a \[start; end) Range with a given step. **Parameters:** * ​T0 (`Indexer`): The type of the start value. * ​T1 (`Indexer`): The type of the end value. * ​T2 (`Indexer`): The type of the step value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. * ​step (`T2`): The step for the range. **Returns:** The constructed range. `range[T0: IntableRaising, T1: IntableRaising, T2: IntableRaising, //](start: T0, end: T1, step: T2) -> _StridedRange` Constructs a \[start; end) Range with a given step. **Parameters:** * ​T0 (`IntableRaising`): The type of the start value. * ​T1 (`IntableRaising`): The type of the end value. * ​T2 (`IntableRaising`): The type of the step value. **Args:** * ​start (`T0`): The start of the range. * ​end (`T1`): The end of the range. * ​step (`T2`): The step for the range. **Returns:** The constructed range. **Raises:** An error if converting `start`, `end`, or `step` to an `Int` failed. `range(start: PythonObject, end: PythonObject, step: PythonObject) -> _StridedRange` Constructs a \[start; end) Range from Python `int` objects with a given step. **Args:** * ​start (`PythonObject`): The start of the range as a Python `int`. * ​end (`PythonObject`): The end of the range as a Python `int`. * ​step (`PythonObject`): The step for the range as a Python `int`. **Returns:** The constructed range. **Raises:** An error if converting `start`, `end`, or `step` to an `Int` failed. `range(end: UInt) -> _UIntZeroStartingRange` Constructs a \[0; end) Range. **Args:** * ​end (`UInt`): The end of the range. **Returns:** The constructed range. `range(start: UInt, end: UInt, step: UInt = UInt(1)) -> _UIntStridedRange` Constructs a \[start; end) Range with a given step. **Args:** * ​start (`UInt`): The start of the range. * ​end (`UInt`): The end of the range. * ​step (`UInt`): The step for the range. Defaults to 1. **Returns:** The constructed range. `range[dtype: DType, //](end: SIMD[dtype, 1]) -> _ZeroStartingScalarRange[dtype]` Constructs a \[start; end) Range with a given step. **Parameters:** * ​dtype (`DType`): The range dtype. **Args:** * ​end (`SIMD[dtype, 1]`): The end of the range. **Returns:** The constructed range. `range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1]) -> _SequentialScalarRange[dtype]` Constructs a \[start; end) Range with a given step. **Parameters:** * ​dtype (`DType`): The range dtype. **Args:** * ​start (`SIMD[dtype, 1]`): The start of the range. * ​end (`SIMD[dtype, 1]`): The end of the range. **Returns:** The constructed range. `range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1], step: SIMD[dtype, 1]) -> _StridedScalarRange[dtype]` Constructs a \[start; end) Range with a given step. **Parameters:** * ​dtype (`DType`): The range dtype. **Args:** * ​start (`SIMD[dtype, 1]`): The start of the range. * ​end (`SIMD[dtype, 1]`): The end of the range. * ​step (`SIMD[dtype, 1]`): The step for the range. Defaults to 1. **Returns:** The constructed range. --- ## read_x `read_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## read_y `read_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## readfirstlane `readfirstlane(value: SIMD[int32, 1]) -> SIMD[int32, 1]` Get the value in the lowest active lane of the input operand. **Args:** * ​value (`SIMD[int32, 1]`): The input value. **Returns:** The value in the lowest active lane of the input operand. `readfirstlane(value: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Get the value in the lowest active lane of the input operand. **Args:** * ​value (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The input pointer. **Returns:** The value in the lowest active lane of the input operand. `readfirstlane(value: Int) -> Int` Get the value in the lowest active lane of the input operand. **Args:** * ​value (`Int`): The input pointer. **Returns:** The value in the lowest active lane of the input operand. --- ## rebind Implements type rebind. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`rebind`](/mojo/stdlib/builtin/rebind/rebind): Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type. --- ## rebind `rebind[src_type: AnyTrivialRegType, //, dest_type: AnyTrivialRegType](src: src_type) -> dest_type` Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type. This function is meant to be used in uncommon cases where a parametric type depends on the value of a constrained parameter in order to manually refine the type with the constrained parameter value. **Parameters:** * ​src\_type (`AnyTrivialRegType`): The original type. * ​dest\_type (`AnyTrivialRegType`): The type to rebind to. **Args:** * ​src (`src_type`): The value to rebind. **Returns:** The rebound value of `dest_type`. `rebind[src_type: AnyType, //, dest_type: AnyType](ref src: src_type) -> ref [src] dest_type` Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type, returning a reference to the input value with an adjusted type. This function is meant to be used in uncommon cases where a parametric type depends on the value of a constrained parameter in order to manually refine the type with the constrained parameter value. **Parameters:** * ​src\_type (`AnyType`): The original type. * ​dest\_type (`AnyType`): The type to rebind to. **Args:** * ​src (`src_type`): The value to rebind. **Returns:** A reference to the value rebound as `dest_type`. --- ## rebuild_mix_precision_static_tensor_specs_with_input_lambda `rebuild_mix_precision_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, src_type: DType, dst_type: DType, rank: Int](spec: StaticTensorSpec[src_type, rank], in_lambda: func_type) -> StaticTensorSpec[dst_type, rank]` --- ## rebuild_mix_precision_static_tensor_specs_with_output_lambda `rebuild_mix_precision_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, dst_type: DType, src_type: DType, rank: Int](spec: StaticTensorSpec[dst_type, rank], out_lambda: func_type) -> StaticTensorSpec[src_type, rank]` --- ## rebuild_static_tensor_specs_with_input_lambda `rebuild_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, type: DType, rank: Int](spec: StaticTensorSpec[type, rank], in_lambda: func_type) -> StaticTensorSpec[type, rank]` --- ## rebuild_static_tensor_specs_with_output_lambda `rebuild_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, type: DType, rank: Int](spec: StaticTensorSpec[type, rank], out_lambda: func_type) -> StaticTensorSpec[type, rank]` --- ## recip `recip[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise reciprocal on a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal on. **Returns:** The elementwise reciprocal of x. --- ## reciprocal `reciprocal(x: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## reduce `reduce[: origin.set, //, reducer: fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int](t: IntTuple[origin], initializer: Int) -> Int` Apply a reduction function to an `IntTuple` with an initial value. This function iterates through each element of the `IntTuple` and applies the provided reduction function cumulatively, starting with the initializer. **Parameters:** * ​reducer (`fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int`): A function that combines the accumulated result with the next element. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to reduce. * ​initializer (`Int`): The initial value for the reduction operation. **Returns:** The final accumulated result after applying the reduction function to all elements in the `IntTuple`. --- ## reduce `reduce[: origin.set, //, reduce_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]](src: NDBuffer[type, 1, origin], init: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Computes a custom reduction of buffer elements. **Parameters:** * ​reduce\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): The lambda implementing the reduction. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The input buffer. * ​init (`SIMD[dtype, 1]`): The initial value to use in accumulator. **Returns:** The computed reduction value. `reduce[: origin.set, //, map_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1], reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], init: SIMD[type, 1])` Performs a reduction across reduce\_axis of an NDBuffer (src) and stores the result in an NDBuffer (dst). First src is reshaped into a 3D tensor. Without loss of generality, the three axes will be referred to as \[H,W,C], where the axis to reduce across is W, the axes before the reduce axis are packed into H, and the axes after the reduce axis are packed into C. i.e. a tensor with dims \[D1, D2, ..., Di, ..., Dn] reducing across axis i gets packed into a 3D tensor with dims \[H, W, C], where H=prod(D1,...,Di-1), W = Di, and C = prod(Di+1,...,Dn). **Parameters:** * ​map\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used when to combine (accumulate) two chunks of input data: e.g. we load two 8xfloat32 vectors of elements and need to reduce them to a single 8xfloat32 vector. * ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to reduce a vector to a scalar. E.g. when we got 8xfloat32 vector and want to reduce it to 1xfloat32. * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The output buffer. * ​init (`SIMD[type, 1]`): The initial value to use in accumulator. --- ## reduce `reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Performs a generic warp-wide reduction operation using shuffle operations. This is a convenience wrapper around lane\_group\_reduce that operates on the entire warp. It allows customizing both the shuffle operation and reduction function. Example: ```mojo from gpu.warp import reduce, shuffle_down # Compute warp-wide sum using shuffle down @parameter fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) capturing -> SIMD[type, width]: return x + y val = SIMD[DType.float32, 4](2.0, 4.0, 6.0, 8.0) result = reduce[shuffle_down, add](val) ``` . **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. * ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and offset and returns the shuffled result. * ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines the reduction operation (e.g. add, max, min). **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value. **Returns:** A SIMD value containing the reduction result broadcast to all lanes in the warp. --- ## reduce_add_simd `reduce_add_simd[simd_width: Int, step_simd_width: Int, type: DType](mut scalar: SIMD[type, 1], mut vector: SIMD[type, simd_width], val: SIMD[type, step_simd_width])` This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize. --- ## reduce_boolean `reduce_boolean[: origin.set, : origin.set, //, reduce_fn: fn[DType, Int](SIMD[$0, $1]) capturing -> Bool, continue_fn: fn(Bool) capturing -> Bool](src: NDBuffer[type, 1, origin], init: Bool) -> Bool` Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False. **Parameters:** * ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) capturing -> Bool`): A boolean reduction function. This function is used to reduce a vector to a scalar. E.g. when we got `8xfloat32` vector and want to reduce it to a `bool`. * ​continue\_fn (`fn(Bool) capturing -> Bool`): A function to indicate whether we want to continue processing the rest of the iterations. This takes the result of the reduce\_fn and returns True to continue processing and False to early exit. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The input buffer. * ​init (`Bool`): The initial value to use. **Returns:** The computed reduction value. --- ## ReduceOp `@register_passable(trivial)` `struct ReduceOp` Represents reduction operations for parallel reduction algorithms. This struct defines different reduction operations that can be performed across multiple threads in parallel. These operations are commonly used in parallel reduction algorithms on GPUs. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ADD` `alias ADD = ReduceOp(0)` Addition reduction operation. Combines values by adding them together. ### `AND` `alias AND = ReduceOp(3)` Bitwise AND reduction operation. Performs bitwise AND across all inputs. ### `MAX` `alias MAX = ReduceOp(2)` Maximum reduction operation. Finds the maximum value across all inputs. ### `MIN` `alias MIN = ReduceOp(1)` Minimum reduction operation. Finds the minimum value across all inputs. ### `OR` `alias OR = ReduceOp(4)` Bitwise OR reduction operation. Performs bitwise OR across all inputs. ### `XOR` `alias XOR = ReduceOp(5)` Bitwise XOR reduction operation. Performs bitwise XOR across all inputs. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Tests if two ReduceOp instances are equal. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Tests if two ReduceOp instances are not equal. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Tests if two ReduceOp instances are identical. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Tests if two ReduceOp instances are not identical. **Args:** * ​other (`Self`): The ReduceOp instance to compare against. **Returns:** True if the reduction operations are not identical, False otherwise. ### `__str__` `__str__(self) -> String` Returns a string representation of the reduction operation. **Returns:** A string describing the reduction operation. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the mnemonic string for the reduction operation. **Returns:** A string literal containing the reduction operation mnemonic. --- ## reduction Implements SIMD reductions. You can import these APIs from the `algorithm` package. For example: ```mojo from algorithm import map_reduce ``` ## Functions * [​`all_true`](/mojo/stdlib/algorithm/reduction/all_true): Returns True if all the elements in a buffer are True and False otherwise. * [​`any_true`](/mojo/stdlib/algorithm/reduction/any_true): Returns True if any the elements in a buffer are True and False otherwise. * [​`cumsum`](/mojo/stdlib/algorithm/reduction/cumsum): Computes the cumulative sum of all elements in a buffer. dst\[i] = src\[i] + src\[i-1] + ... + src\[0]. * [​`map_reduce`](/mojo/stdlib/algorithm/reduction/map_reduce): Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function. * [​`max`](/mojo/stdlib/algorithm/reduction/max): Computes the max element in a buffer. * [​`mean`](/mojo/stdlib/algorithm/reduction/mean): Computes the mean value of the elements in a buffer. * [​`min`](/mojo/stdlib/algorithm/reduction/min): Computes the min element in a buffer. * [​`none_true`](/mojo/stdlib/algorithm/reduction/none_true): Returns True if none of the elements in a buffer are True and False otherwise. * [​`product`](/mojo/stdlib/algorithm/reduction/product): Computes the product of the buffer elements. * [​`reduce`](/mojo/stdlib/algorithm/reduction/reduce): Computes a custom reduction of buffer elements. * [​`reduce_boolean`](/mojo/stdlib/algorithm/reduction/reduce_boolean): Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False. * [​`sum`](/mojo/stdlib/algorithm/reduction/sum): Computes the sum of buffer elements. * [​`variance`](/mojo/stdlib/algorithm/reduction/variance): Given a mean, computes the variance of elements in a buffer. --- ## ReductionMethod `@register_passable(trivial)` `struct ReductionMethod` Enumerates the supported reduction methods. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `TENSOR_CORE` `alias TENSOR_CORE = ReductionMethod(0)` Use tensor core for reduction. ### `WARP` `alias WARP = ReductionMethod(1)` Use warp shuffle for reduction. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two ReductionMethod are equal. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are equal, false otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two ReductionMethod are not equal. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are not equal, false otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two ReductionMethod are identical. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are identical, false otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two ReductionMethod are not identical. **Args:** * ​other (`Self`): The other ReductionMethod to compare. **Returns:** True if the ReductionMethod are not identical, false otherwise. --- ## reflection ## Functions * [​`get_linkage_name`](/mojo/stdlib/compile/reflection/get_linkage_name): Returns `func` symbol name. --- ## Register A GPU register is the fastest form of storage within a [streaming multiprocessor](streaming-multiprocessor.mdx) (SM). Registers store integer and floating point values used frequently by a [thread](thread.mdx), reducing reliance on slower [memory](memory.mdx) types (shared, global, or local memory). Registers are located within an SM in what is referred to as a *register file*. The number of registers depends on the GPU architecture, but modern GPUs support thousands of registers per SM. For each thread that it executes, the SM allocates a set of registers for the private use of that thread. The registers are associated with that thread throughout its lifetime, even if the thread is not currently executing on the SM's cores (for example, if it is blocked waiting for data from memory). A thread can't access registers assigned to a different thread, preventing data conflicts between threads. If the execution of a [kernel](kernel.mdx) function by a thread requires more registers than available, the compiler arranges to spill some register data to the thread's local [memory](memory.mdx). Because local memory access is slower than register access, programmers should try to design their kernels to avoid or limit the amount of spill. --- ## registry Model registry, for tracking various model variants. ## `PipelineRegistry` {#max.pipelines.lib.registry.PipelineRegistry} > *class* max.pipelines.lib.registry.PipelineRegistry(architectures) **Parameters:** **architectures** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`SupportedArchitecture`](#max.pipelines.lib.registry.SupportedArchitecture) `]` ) ### `get_active_huggingface_config()` {#max.pipelines.lib.registry.PipelineRegistry.get_active_huggingface_config} > get\_active\_huggingface\_config(huggingface\_repo) Retrieves or creates a cached HuggingFace AutoConfig for the given model configuration. This method maintains a cache of HuggingFace configurations to avoid reloading them unnecessarily which incurs a huggingface hub API call. If a config for the given model hasn’t been loaded before, it will create a new one using AutoConfig.from\_pretrained() with the model’s settings. **Parameters:** **huggingface\_repo** ([`HuggingFaceRepo`](hf_utils.md#max.pipelines.lib.hf_utils.HuggingFaceRepo) ) – The HuggingFaceRepo containing the model. **Returns:** The HuggingFace configuration object for the model. **Return type:** AutoConfig ### `get_active_tokenizer()` {#max.pipelines.lib.registry.PipelineRegistry.get_active_tokenizer} > get\_active\_tokenizer(huggingface\_repo) Retrieves or creates a cached HuggingFace AutoTokenizer for the given model configuration. This method maintains a cache of HuggingFace tokenizers to avoid reloading them unnecessarily which incurs a huggingface hub API call. If a tokenizer for the given model hasn’t been loaded before, it will create a new one using AutoTokenizer.from\_pretrained() with the model’s settings. **Parameters:** **huggingface\_repo** ([`HuggingFaceRepo`](hf_utils.md#max.pipelines.lib.hf_utils.HuggingFaceRepo) ) – The HuggingFaceRepo containing the model. **Returns:** The HuggingFace tokenizer for the model. **Return type:** PreTrainedTokenizer | PreTrainedTokenizerFast ### `register()` {#max.pipelines.lib.registry.PipelineRegistry.register} > register(architecture, \*, allow\_override=False) Add new architecture to registry. **Parameters:** * **architecture** ([`SupportedArchitecture`](#max.pipelines.lib.registry.SupportedArchitecture) ) * **allow\_override** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** None ### `reset()` {#max.pipelines.lib.registry.PipelineRegistry.reset} > reset() **Return type:** None ### `retrieve()` {#max.pipelines.lib.registry.PipelineRegistry.retrieve} > retrieve(pipeline\_config, task=PipelineTask.TEXT\_GENERATION, override\_architecture=None) **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) ) * **override\_architecture** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[PipelineTokenizer](core.md#max.pipelines.core.PipelineTokenizer), PipelineTypes] ### `retrieve_architecture()` {#max.pipelines.lib.registry.PipelineRegistry.retrieve_architecture} > retrieve\_architecture(huggingface\_repo) **Parameters:** **huggingface\_repo** ([`HuggingFaceRepo`](hf_utils.md#max.pipelines.lib.hf_utils.HuggingFaceRepo) ) **Return type:** [*SupportedArchitecture*](#max.pipelines.lib.registry.SupportedArchitecture) | None ### `retrieve_factory()` {#max.pipelines.lib.registry.PipelineRegistry.retrieve_factory} > retrieve\_factory(pipeline\_config, task=PipelineTask.TEXT\_GENERATION, override\_architecture=None) **Parameters:** * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) * **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) ) * **override\_architecture** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) **Return type:** [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[PipelineTokenizer](core.md#max.pipelines.core.PipelineTokenizer), Callable\[\[], PipelineTypes]] ## `SupportedArchitecture` {#max.pipelines.lib.registry.SupportedArchitecture} > *class* max.pipelines.lib.registry.SupportedArchitecture(name, example\_repo\_ids, default\_encoding, supported\_encodings, pipeline\_model, task, tokenizer, default\_weights\_format, multi\_gpu\_supported=False, rope\_type=RopeType.none, weight\_adapters=None) Initializes a model architecture supported by MAX pipelines. New architectures should be registered into the [`PipelineRegistry`](#max.pipelines.lib.registry.PipelineRegistry). **Parameters:** * **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Architecture name. * **example\_repo\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – Hugging Face repo\_id which runs this architecture. * **default\_encoding** (`SupportedEncoding` ) – Default encoding for the model. * **supported\_encodings** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` `SupportedEncoding` `,` [`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`KVCacheStrategy`](../nn/kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheStrategy) `]` `]` ) – Alternate encodings supported. * **pipeline\_model** ([`type`](https://docs.python.org/3/library/functions.html#type) `[` [`PipelineModel`](pipeline.md#max.pipelines.lib.pipeline.PipelineModel) `]` ) – `PipelineModel` class that defines the model graph and execution. * **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) ) – Which pipeline task should the model run with. * **tokenizer** (`Callable` `[` `...` `,` [`PipelineTokenizer`](core.md#max.pipelines.core.PipelineTokenizer) `]` ) – Tokenizer used to preprocess model inputs. * **default\_weights\_format** (`WeightsFormat` ) – The weights format used in pipeline\_model. * **weight\_converters** – A dictionary of weight loaders to use if the input checkpoint has a different format than the default. * **multi\_gpu\_supported** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **rope\_type** (`RopeType` ) * **weight\_adapters** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` `WeightsFormat` `,` `WeightsAdapter` `]` `|` `None` ) ### `tokenizer_cls` {#max.pipelines.lib.registry.SupportedArchitecture.tokenizer_cls} > *property* tokenizer\_cls\*: [type](https://docs.python.org/3/library/functions.html#type)\[[PipelineTokenizer](core.md#max.pipelines.core.PipelineTokenizer)]\* ## `get_pipeline_for_task()` {#max.pipelines.lib.registry.get_pipeline_for_task} > max.pipelines.lib.registry.get\_pipeline\_for\_task(task, pipeline\_config) **Parameters:** * **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) ) * **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) **Return type:** [type](https://docs.python.org/3/library/functions.html#type)\[[TextGenerationPipeline](pipeline.md#max.pipelines.lib.pipeline.TextGenerationPipeline)] | [type](https://docs.python.org/3/library/functions.html#type)\[EmbeddingsPipeline] | [type](https://docs.python.org/3/library/functions.html#type)\[SpeculativeDecodingTextGenerationPipeline] | [type](https://docs.python.org/3/library/functions.html#type)\[AudioGeneratorPipeline] --- ## relu `relu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the Relu Op using the equation $max(0, x)$. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the RELU operation on. **Returns:** The result of the RELU operation. --- ## relu_n1 `relu_n1[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the Relu N1 Op using the equation $max(min(x,1),-1)$. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the RELU N1 operation on. **Returns:** The result of the RELU N1 operation. --- ## remainder `remainder[dtype: DType, width: Int, //](x: SIMD[dtype, width], y: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `remainder` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The first input argument. * ​y (`SIMD[dtype, width]`): The second input argument. **Returns:** The `remainder` of the inputs. --- ## remove `remove[PathLike: PathLike](path: PathLike)` Removes the specified file. If the path is a directory or it can not be deleted, an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file. --- ## removedirs `removedirs[PathLike: PathLike](path: PathLike)` Removes a leaf directory and all empty intermediate ones. Directories corresponding to rightmost path segments will be pruned away until either the whole path is consumed or an error occurs. Errors during this latter phase are ignored, which occur when a directory was not empty. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. --- ## reorder_padding `reorder_padding[rank: Int](pad: DimList) -> DimList` --- ## repack_GPTQ_for_sm8x `repack_GPTQ_for_sm8x[in_layout: Layout, out_layout: Layout, scales_type: DType, group_size: Int, has_perm: Bool, *, perm_layout: Layout = Layout()](in_tensor: LayoutTensor[uint8, in_layout, MutableAnyOrigin], out_tensor: LayoutTensor[uint8, out_layout, MutableAnyOrigin], perm_idx: LayoutTensor[int32, perm_layout, MutableAnyOrigin])` --- ## repack_Q4_0_for_sm8x `repack_Q4_0_for_sm8x[q_layout: Layout, repack_layout: Layout, scales_type: DType](q_weight: LayoutTensor[uint8, q_layout, MutableAnyOrigin], q_packed_weight: LayoutTensor[uint8, repack_layout, MutableAnyOrigin])` --- ## repeat_interleave ## Functions * [​`repeat_interleave`](./repeat_interleave): Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer. * [​`repeat_interleave_shape`](./repeat_interleave_shape): --- ## repeat_interleave `repeat_interleave[type: DType, rank: Int, type_repeats: DType](input: NDBuffer[type, rank, origin], repeats: NDBuffer[type_repeats, 1, origin], axis: Int, output: NDBuffer[type, rank, origin])` Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer. This is intended to implement the same functionality as torch.repeat: **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input buffer. * ​repeats (`NDBuffer[type_repeats, 1, origin]`): The number of repetitions each element in input. * ​axis (`Int`): The axis along which to repeat values. * ​output (`NDBuffer[type, rank, origin]`): The output buffer. --- ## repeat_interleave_shape `repeat_interleave_shape[type_repeats: DType](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], repeats: NDBuffer[type_repeats, 1, origin], axis: Int) -> IndexList[rank]` --- ## Report `struct Report` Contains the average execution time, iterations, min and max of each batch. ## Fields * ​warmup\_duration (`Int`): The total duration it took to warmup. * ​runs (`List[Batch]`): A `List` of benchmark runs. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Default initializer for the Report. Sets all values to 0 `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Creates a shallow copy (it doesn't copy the data). **Args:** * ​existing (`Self`): The `Report` to copy. ### `iters` `iters(self) -> Int` The total benchmark iterations. **Returns:** The total benchmark iterations. ### `duration` `duration(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The total duration it took to run all benchmarks. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The total duration it took to run all benchmarks. ### `mean` `mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The average duration of all benchmark runs. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The average duration of all benchmark runs. ### `min` `min(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The batch of benchmarks that was the fastest to run. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The fastest duration out of all batches. ### `max` `max(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]` The batch of benchmarks that was the slowest to run. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). **Returns:** The slowest duration out of all batches. ### `print` `print(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))` Prints out the shortened version of the report. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). ### `print_full` `print_full(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))` Prints out the full version of the report with each batch of benchmark runs. **Args:** * ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`). --- ## repr Provide the `repr` function. The functions and traits provided here are built-ins, so you don't need to import them. ## Traits * [​`Representable`](/mojo/stdlib/builtin/repr/Representable): A trait that describes a type that has a String representation. ## Functions * [​`repr`](/mojo/stdlib/builtin/repr/repr): Returns the string representation of the given value. --- ## repr `repr[T: Representable](value: T) -> String` Returns the string representation of the given value. **Parameters:** * ​T (`Representable`): The type of `value`. Must implement the `Representable` trait. **Args:** * ​value (`T`): The value to get the string representation of. **Returns:** The string representation of the given value. `repr(value: None) -> String` Returns the string representation of `None`. **Args:** * ​value (`None`): A `None` value. **Returns:** The string representation of `None`. --- ## Representable A trait that describes a type that has a String representation. Any type that conforms to the `Representable` trait can be used with the `repr` function. Any conforming type must also implement the `__repr__` method. Here is an example: ```mojo struct Dog(Representable): var name: String var age: Int fn __repr__(self) -> String: return "Dog(name=" + repr(self.name) + ", age=" + repr(self.age) + ")" var dog = Dog("Rex", 5) print(repr(dog)) # Dog(name='Rex', age=5) ``` The method `__repr__` should compute the "official" string representation of a type. If at all possible, this should look like a valid Mojo expression that could be used to recreate a struct instance with the same value (given an appropriate environment). So a returned String of the form `module_name.SomeStruct(arg1=value1, arg2=value2)` is advised. If this is not possible, a string of the form `` should be returned. The return value must be a `String` instance. This is typically used for debugging, so it is important that the representation is information-rich and unambiguous. Note that when computing the string representation of a collection (`Dict`, `List`, `Set`, etc...), the `repr` function is called on each element, not the `String()` function. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__repr__` `__repr__(self: _Self) -> String` Get the string representation of the type instance, if possible, compatible with Mojo syntax. **Returns:** The string representation of the instance. --- ## reshape ## Functions * [​`ndbuffer_reshape`](./ndbuffer_reshape): * [​`reshape`](./reshape): * [​`reshape_shape`](./reshape_shape): --- ## reshape `reshape[rank: Int, type: DType, //, output_rank: Int, single_thread_blocking_override: Bool = True](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]` --- ## reshape_shape `reshape_shape[input_rank: Int, output_rank: Int, input_type: DType, target_shape_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], target_shape_buf: NDBuffer[target_shape_type, 1, origin]) -> IndexList[output_rank]` --- ## resize ## Structs * [​`CoordinateTransformationMode`](./CoordinateTransformationMode): * [​`InterpolationMode`](./InterpolationMode): * [​`Interpolator`](./Interpolator): * [​`RoundMode`](./RoundMode): ## Functions * [​`coord_transform`](./coord_transform): * [​`interpolate_point_1d`](./interpolate_point_1d): * [​`linear_filter`](./linear_filter): This is a tent filter. * [​`resize_linear`](./resize_linear): Resizes input to output shape using linear interpolation. * [​`resize_nearest_neighbor`](./resize_nearest_neighbor): --- ## resize_linear `resize_linear[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])` Resizes input to output shape using linear interpolation. Does not use anti-aliasing filter for downsampling (coming soon). **Parameters:** * ​coordinate\_transformation\_mode (`CoordinateTransformationMode`): How to map a coordinate in output to a coordinate in input. * ​antialias (`Bool`): Whether or not to use an antialiasing linear/cubic filter, which when downsampling, uses more points to avoid aliasing artifacts. Effectively stretches the filter by a factor of 1 / scale. * ​rank (`Int`): Rank of the input and output. * ​type (`DType`): Type of input and output. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input to be resized. * ​output (`NDBuffer[type, rank, origin]`): The output containing the resized input. --- ## resize_nearest_neighbor `resize_nearest_neighbor[coordinate_transformation_mode: CoordinateTransformationMode, round_mode: RoundMode, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])` --- ## Result `@register_passable(trivial)` `struct Result` ## Fields * ​code (`SIMD[int32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility` ## Aliases ### `ALREADY_INITIALIZED` `alias ALREADY_INITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](5))` Deprecated: Multiple initializations are now allowed through ref counting ### `ARGUMENT_VERSION_MISMATCH` `alias ARGUMENT_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](25))` The provided version is invalid/unsupported ### `CORRUPTED_INFOROM` `alias CORRUPTED_INFOROM = Result(__init__[__mlir_type.!pop.int_literal](14))` infoROM is corrupted ### `DEPRECATED` `alias DEPRECATED = Result(__init__[__mlir_type.!pop.int_literal](26))` The requested functionality has been deprecated ### `DRIVER_NOT_LOADED` `alias DRIVER_NOT_LOADED = Result(__init__[__mlir_type.!pop.int_literal](9))` NVIDIA driver is not loaded ### `FREQ_NOT_SUPPORTED` `alias FREQ_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](24))` Ran out of critical resources, other than memory ### `FUNCTION_NOT_FOUND` `alias FUNCTION_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](13))` Local version of NVML doesn't implement this function ### `GPU_IS_LOST` `alias GPU_IS_LOST = Result(__init__[__mlir_type.!pop.int_literal](15))` The GPU has fallen off the bus or has otherwise become inaccessible ### `GPU_NOT_FOUND` `alias GPU_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](28))` No GPUs were found ### `IN_USE` `alias IN_USE = Result(__init__[__mlir_type.!pop.int_literal](19))` An operation cannot be performed because the GPU is currently in use ### `INSUFFICIENT_POWER` `alias INSUFFICIENT_POWER = Result(__init__[__mlir_type.!pop.int_literal](8))` A device's external power cables are not properly attached ### `INSUFFICIENT_RESOURCES` `alias INSUFFICIENT_RESOURCES = Result(__init__[__mlir_type.!pop.int_literal](23))` Ran out of critical resources, other than memory ### `INSUFFICIENT_SIZE` `alias INSUFFICIENT_SIZE = Result(__init__[__mlir_type.!pop.int_literal](7))` An input argument is not large enough ### `INVALID_ARGUMENT` `alias INVALID_ARGUMENT = Result(__init__[__mlir_type.!pop.int_literal](2))` A supplied argument is invalid ### `IRQ_ISSUE` `alias IRQ_ISSUE = Result(__init__[__mlir_type.!pop.int_literal](11))` NVIDIA Kernel detected an interrupt issue with a GPU ### `LIB_RM_VERSION_MISMATCH` `alias LIB_RM_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](18))` RM detects a driver/library version mismatch ### `LIBRARY_NOT_FOUND` `alias LIBRARY_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](12))` NVML Shared Library couldn't be found or loaded ### `MEMORY` `alias MEMORY = Result(__init__[__mlir_type.!pop.int_literal](20))` Insufficient memory ### `NO_DATA` `alias NO_DATA = Result(__init__[__mlir_type.!pop.int_literal](21))` No data ### `NO_PERMISSION` `alias NO_PERMISSION = Result(__init__[__mlir_type.!pop.int_literal](4))` The current user does not have permission for operation ### `NOT_FOUND` `alias NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](6))` A query to find an object was unsuccessful ### `NOT_READY` `alias NOT_READY = Result(__init__[__mlir_type.!pop.int_literal](27))` The system is not ready for the request ### `NOT_SUPPORTED` `alias NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](3))` The requested operation is not available on target device ### `OPERATING_SYSTEM` `alias OPERATING_SYSTEM = Result(__init__[__mlir_type.!pop.int_literal](17))` The GPU control device has been blocked by the operating system/cgroups ### `RESET_REQUIRED` `alias RESET_REQUIRED = Result(__init__[__mlir_type.!pop.int_literal](16))` The GPU requires a reset before it can be used again ### `SUCCESS` `alias SUCCESS = Result(__init__[__mlir_type.!pop.int_literal](0))` The operation was successful ### `TIMEOUT` `alias TIMEOUT = Result(__init__[__mlir_type.!pop.int_literal](10))` User provided timeout passed ### `UNINITIALIZED` `alias UNINITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](1))` NVML was not first initialized with nvmlInit() ### `UNKNOWN` `alias UNKNOWN = Result(__init__[__mlir_type.!pop.int_literal](999))` An internal driver error occurred ### `VGPU_ECC_NOT_SUPPORTED` `alias VGPU_ECC_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](22))` The requested vgpu operation is not available on target device, becasue ECC is enabled ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` ### `__ne__` `__ne__(self, other: Self) -> Bool` ### `__str__` `__str__(self) -> String` --- ## reverse `reverse(src: IntTuple[origin]) -> IntTuple` Reverses the order of elements in an `IntTuple`, recursively. This function reverses the top-level elements of the `IntTuple` and recursively reverses any nested `IntTuple`s. Example: ```mojo from layout.int_tuple import IntTuple, reverse var t = IntTuple(1, 2, IntTuple(3, 4)) var reversed = reverse(t) # returns ((4, 3), 2, 1) ``` . **Args:** * ​src (`IntTuple[origin]`): The source `IntTuple` to reverse. **Returns:** A new `IntTuple` with elements in reversed order. --- ## reverse_idx `reverse_idx[transpose: Bool](x: Int, y: Int) -> IndexList[2]` --- ## reversed Provides the `reversed` function for reverse iteration over collections. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`ReversibleRange`](/mojo/stdlib/builtin/reversed/ReversibleRange): The `ReversibleRange` trait describes a range that can be reversed. ## Functions * [​`reversed`](/mojo/stdlib/builtin/reversed/reversed): Get a reversed iterator of the input range. --- ## reversed `reversed[T: ReversibleRange](value: T) -> _StridedRange` Get a reversed iterator of the input range. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`ReversibleRange`): The type conforming to ReversibleRange. **Args:** * ​value (`T`): The range to get the reversed iterator of. **Returns:** The reversed iterator of the range. `reversed[T: Copyable & Movable](ref value: List[T, hint_trivial_type]) -> _ListIter[T, hint_trivial_type, value_is_origin, False]` Get a reversed iterator of the input list. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`Copyable & Movable`): The type of the elements in the list. **Args:** * ​value (`List[T, hint_trivial_type]`): The list to get the reversed iterator of. **Returns:** The reversed iterator of the list. `reversed[T: Copyable & Movable](ref value: Deque[T]) -> _DequeIter[T, value_is_origin, False]` Get a reversed iterator of the deque. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`Copyable & Movable`): The type of the elements in the deque. **Args:** * ​value (`Deque[T]`): The deque to get the reversed iterator of. **Returns:** The reversed iterator of the deque. `reversed[K: KeyElement, V: Copyable & Movable](ref value: Dict[K, V]) -> _DictKeyIter[K, V, value_is_origin, False]` Get a reversed iterator of the input dict. **Note**: iterators are currently non-raising. **Parameters:** * ​K (`KeyElement`): The type of the keys in the dict. * ​V (`Copyable & Movable`): The type of the values in the dict. **Args:** * ​value (`Dict[K, V]`): The dict to get the reversed iterator of. **Returns:** The reversed iterator of the dict keys. `reversed[K: KeyElement, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictValueIter[K, V, dict_origin]) -> _DictValueIter[K, V, dict_origin, False]` Get a reversed iterator of the input dict values. **Note**: iterators are currently non-raising. **Parameters:** * ​K (`KeyElement`): The type of the keys in the dict. * ​V (`Copyable & Movable`): The type of the values in the dict. * ​dict\_mutability (`Bool`): Whether the reference to the dict values is mutable. * ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict values. **Args:** * ​value (`_DictValueIter[K, V, dict_origin]`): The dict values to get the reversed iterator of. **Returns:** The reversed iterator of the dict values. `reversed[K: KeyElement, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictEntryIter[K, V, dict_origin]) -> _DictEntryIter[K, V, dict_origin, False]` Get a reversed iterator of the input dict items. **Note**: iterators are currently non-raising. **Parameters:** * ​K (`KeyElement`): The type of the keys in the dict. * ​V (`Copyable & Movable`): The type of the values in the dict. * ​dict\_mutability (`Bool`): Whether the reference to the dict items is mutable. * ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict items. **Args:** * ​value (`_DictEntryIter[K, V, dict_origin]`): The dict items to get the reversed iterator of. **Returns:** The reversed iterator of the dict items. `reversed[T: Copyable & Movable](value: Span[T, origin]) -> _SpanIter[T, origin, False]` Get a reversed iterator of the input Span. **Note**: iterators are currently non-raising. **Parameters:** * ​T (`Copyable & Movable`): The type of the elements in the Span. **Args:** * ​value (`Span[T, origin]`): The Span to get the reversed iterator of. **Returns:** The reversed iterator of the Span. --- ## ReversibleRange The `ReversibleRange` trait describes a range that can be reversed. Any type that conforms to `ReversibleRange` works with the builtin [`reversed()`](/mojo/stdlib/builtin/reversed.html) functions. The `ReversibleRange` trait requires the type to define the `__reversed__()` method. **Note**: iterators are currently non-raising. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__reversed__` `__reversed__(self: _Self) -> _StridedRange` Get a reversed iterator for the type. **Note**: iterators are currently non-raising. **Returns:** The reversed iterator of the type. --- ## right_inverse `right_inverse(layout: Layout) -> Layout` Creates a right inverse of a layout. The right inverse of a layout maps memory indices back to logical coordinates. This is useful for converting between different memory layouts. **Args:** * ​layout (`Layout`): The layout to invert. **Returns:** A new layout representing the right inverse of the input layout. --- ## rmdir `rmdir[PathLike: PathLike](path: PathLike)` Removes the specified directory. If the path is not a directory or it can not be deleted, an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. --- ## rms_norm Normalization layer. ## `DistributedRMSNorm` {#max.nn.norm.rms_norm.DistributedRMSNorm} > *class* max.nn.norm.rms\_norm.DistributedRMSNorm(\*args, devices, \*\*kwargs) **Parameters:** **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) ## `RMSNorm` {#max.nn.norm.rms_norm.RMSNorm} > *class* max.nn.norm.rms\_norm.RMSNorm(dim, dtype, eps=1e-06, weight\_offset=0.0, multiply\_before\_cast=True) Computes the Root Mean Square normalization on inputs. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Size of last dimension of the expected input. * **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Value added to denominator for numerical stability. * **weight\_offset** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Constant offset added to the learned weights at runtime. For Gemma-style RMSNorm, this should be set to 1.0. * **multiply\_before\_cast** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – True if we multiply the inputs by the learned weights before casting to the input type (Gemma3-style). False if we cast the inputs to the input type first, then multiply by the learned weights (Llama-style). * **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) ## `RMSNormV1` {#max.nn.norm.rms_norm.RMSNormV1} > *class* max.nn.norm.rms\_norm.RMSNormV1(weight, eps=1e-06, weight\_offset=0.0, multiply\_before\_cast=True) Computes the Root Mean Square normalization on inputs. Deprecated: Use RMSNorm instead. **Parameters:** * **weight** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) * **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **weight\_offset** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **multiply\_before\_cast** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `eps` {#max.nn.norm.rms_norm.RMSNormV1.eps} > eps\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 1e-06* ### `multiply_before_cast` {#max.nn.norm.rms_norm.RMSNormV1.multiply_before_cast} > multiply\_before\_cast\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True* ### `weight` {#max.nn.norm.rms_norm.RMSNormV1.weight} > weight\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\* ### `weight_offset` {#max.nn.norm.rms_norm.RMSNormV1.weight_offset} > weight\_offset\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 0.0* --- ## rms_norm `rms_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), multiply_before_cast: Bool = True](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)` --- ## rms_norm_cpu `rms_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], output_fn: fn[Int](Int, Int, SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], out_shape: IndexList[2])` `rms_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1])` --- ## rms_norm_gpu `rms_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank, element_type=element_type], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], ctx: DeviceContext)` --- ## rms_norm_gpu_block `rms_norm_gpu_block[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)` --- ## rms_norm_gpu_warp_tiling `rms_norm_gpu_warp_tiling[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)` --- ## rms_norm_kv_cache_ragged_continuous_batching `rms_norm_kv_cache_ragged_continuous_batching[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool](kv_collection: ContinuousBatchingKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim))], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)` Performs RMSNorm in place on new entries in the key cache. This is done by first creating the ragged tensor weight\_shape (total\_seq\_len, num\_heads, head\_dim) of the new token tensor. To do this we need to pass in `total_seq_len` on host. Then, using `input_row_offsets` we find the corresponding batch and token index, and use that together with the static head and channel indices to store to/load from the key cache. This uses the input/output lambdas on the RMSNorm kernel. This function could apply RMSNorm to a subset of dimensions in each head, determined by the size of the gamma tensor. In this case, it operates on a ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads, rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be --- ## rms_norm_kv_cache_ragged_paged `rms_norm_kv_cache_ragged_paged[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool](kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)` Performs RMSNorm in place on new entries in the key cache. This is done by first creating the ragged tensor weight\_shape (total\_seq\_len, num\_heads, head\_dim) of the new token tensor. To do this we need to pass in `total_seq_len` on host. Then, using `input_row_offsets` we find the corresponding batch and token index, and use that together with the static head and channel indices to store to/load from the key cache. This uses the input/output lambdas on the RMSNorm kernel. This function could apply RMSNorm to a subset of dimensions in each head, determined by the size of the gamma tensor. In this case, it operates on a ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads, rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be --- ## rms_norm_shape `rms_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1]) -> IndexList[rank]` --- ## roi_align ## Structs * [​`Weighted2DPoint`](./Weighted2DPoint): Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation. ## Functions * [​`roi_align_nhwc`](./roi_align_nhwc): Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C]. --- ## roi_align_nhwc `roi_align_nhwc[type: DType, output_layout: Layout, input_layout: Layout, roi_layout: Layout, //, aligned: Bool, mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("AVG")](output: LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rois: LayoutTensor[type, roi_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output_height: Int, output_width: Int, in_spatial_scale: SIMD[dtype, 1], in_sampling_ratio: SIMD[dtype, 1])` Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C]. **Parameters:** * ​type (`DType`): Type of the input tensor. * ​output\_layout (`Layout`): The output layout. * ​input\_layout (`Layout`): The input layout. * ​roi\_layout (`Layout`): The layout of the regions of interests (ROI). * ​aligned (`Bool`): If not true offset the ROIs by 0.5. * ​mode (`StringSlice[StaticConstantOrigin]`): The pooling mode "AVG" for average and "MAX" for max pooling. --- ## rope_k_cache `rope_k_cache[type: DType, cache_t: KVCacheT, width: Int, //, *, interleaved: Bool](k_cache: cache_t, b_idx: Int, h_idx: Int, s_idx: Int, d_idx: Int, freq_val: SIMD[type, width], head_size: Int)` --- ## rope_q_proj `rope_q_proj[type: DType, rank: Int, width: Int, //, *, interleaved: Bool](q_proj: NDBuffer[type, rank, origin, shape, strides], output: NDBuffer[type, rank, origin, shape, strides], idx: IndexList[rank], freq_val: SIMD[type, width], head_size: Int)` --- ## rotary_embedding The rope embedding used within the model. ## `DeepseekYarnRopeScalingParams` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams} > *class* max.nn.rotary\_embedding.DeepseekYarnRopeScalingParams(scaling\_factor: [float](https://docs.python.org/3/library/functions.html#float), original\_max\_position\_embeddings: [int](https://docs.python.org/3/library/functions.html#int), beta\_fast: [int](https://docs.python.org/3/library/functions.html#int), beta\_slow: [int](https://docs.python.org/3/library/functions.html#int), mscale: [float](https://docs.python.org/3/library/functions.html#float), mscale\_all\_dim: [float](https://docs.python.org/3/library/functions.html#float)) **Parameters:** * **scaling\_factor** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **original\_max\_position\_embeddings** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **beta\_fast** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **beta\_slow** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **mscale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **mscale\_all\_dim** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ### `beta_fast` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.beta_fast} > beta\_fast\*: [int](https://docs.python.org/3/library/functions.html#int)\* Fast interpolation rate. ### `beta_slow` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.beta_slow} > beta\_slow\*: [int](https://docs.python.org/3/library/functions.html#int)\* Slow interpolation rate. ### `mscale` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.mscale} > mscale\*: [float](https://docs.python.org/3/library/functions.html#float)\* Scaling factor for middle frequencies. ### `mscale_all_dim` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.mscale_all_dim} > mscale\_all\_dim\*: [float](https://docs.python.org/3/library/functions.html#float)\* Scaling factor applied to all dimensions. ### `original_max_position_embeddings` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.original_max_position_embeddings} > original\_max\_position\_embeddings\*: [int](https://docs.python.org/3/library/functions.html#int)\* Original maximum sequence length during training. ### `scaling_factor` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.scaling_factor} > scaling\_factor\*: [float](https://docs.python.org/3/library/functions.html#float)\* Scaling factor for frequency interpolation. ## `DeepseekYarnRotaryEmbedding` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding} > *class* max.nn.rotary\_embedding.DeepseekYarnRotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True, scaling\_params=None) Deepseek’s YaRN (Yet another RoPE eNhancement) Rotary Position Embedding layer. Unlike Llama3RotaryEmbedding, the dim argument here is the rope dimension of the model, not the hidden dimension. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`DeviceRef` ) * **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **\_freqs\_cis** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **scaling\_params** ([`DeepseekYarnRopeScalingParams`](#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams) `|` `None` ) ### `compute_scale()` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding.compute_scale} > compute\_scale(user\_scale=None) **Parameters:** **user\_scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) **Return type:** [float](https://docs.python.org/3/library/functions.html#float) ### `freqs_cis_base()` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding.freqs_cis_base} > freqs\_cis\_base() Computes the frequency tensor for complex exponentials (cis) for a given seq\_len. Tensor is scaled with theta parameter. Required to apply Rotary Position Embedding (RoPE) to tensor. See ‘Roformer: Enhanced Transformer with Rotary Embedding’ (arxiv.org/pdf/2104.09864). **Returns:** The frequency tensor for complex exponentials with shape (max\_seq\_len, rope\_dim // 2, 2) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ### `scaling_params` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding.scaling_params} > scaling\_params\*: [DeepseekYarnRopeScalingParams](#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* ## `LinearScalingParams` {#max.nn.rotary_embedding.LinearScalingParams} > *class* max.nn.rotary\_embedding.LinearScalingParams(factor: [float](https://docs.python.org/3/library/functions.html#float)) **Parameters:** **factor** ([`float`](https://docs.python.org/3/library/functions.html#float) ) ### `factor` {#max.nn.rotary_embedding.LinearScalingParams.factor} > factor\*: [float](https://docs.python.org/3/library/functions.html#float)\* Main scaling factor for the frequency components of the rope. ## `Llama3RopeScalingParams` {#max.nn.rotary_embedding.Llama3RopeScalingParams} > *class* max.nn.rotary\_embedding.Llama3RopeScalingParams(factor: [float](https://docs.python.org/3/library/functions.html#float), low\_freq\_factor: [float](https://docs.python.org/3/library/functions.html#float), high\_freq\_factor: [float](https://docs.python.org/3/library/functions.html#float), orig\_max\_position: [int](https://docs.python.org/3/library/functions.html#int)) **Parameters:** * **factor** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **low\_freq\_factor** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **high\_freq\_factor** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **orig\_max\_position** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `factor` {#max.nn.rotary_embedding.Llama3RopeScalingParams.factor} > factor\*: [float](https://docs.python.org/3/library/functions.html#float)\* Main scaling factor for the frequency components of the rope. ### `high_freq_factor` {#max.nn.rotary_embedding.Llama3RopeScalingParams.high_freq_factor} > high\_freq\_factor\*: [float](https://docs.python.org/3/library/functions.html#float)\* Factor to scale the high frequency components of the rope. ### `low_freq_factor` {#max.nn.rotary_embedding.Llama3RopeScalingParams.low_freq_factor} > low\_freq\_factor\*: [float](https://docs.python.org/3/library/functions.html#float)\* Factor to scale the low frequency components of the rope. ### `orig_max_position` {#max.nn.rotary_embedding.Llama3RopeScalingParams.orig_max_position} > orig\_max\_position\*: [int](https://docs.python.org/3/library/functions.html#int)\* The original maximum position length supported by the model. ## `Llama3RotaryEmbedding` {#max.nn.rotary_embedding.Llama3RotaryEmbedding} > *class* max.nn.rotary\_embedding.Llama3RotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True, scaling\_params=None) RotaryEmbedding for Llama3 that takes rope scaling into account. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`DeviceRef` ) * **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **\_freqs\_cis** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **scaling\_params** ([`Llama3RopeScalingParams`](#max.nn.rotary_embedding.Llama3RopeScalingParams) `|` `None` ) ### `scaling_params` {#max.nn.rotary_embedding.Llama3RotaryEmbedding.scaling_params} > scaling\_params\*: [Llama3RopeScalingParams](#max.nn.rotary_embedding.Llama3RopeScalingParams) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* Scaling parameters to enable llama to function with a longer context length. ## `OptimizedRotaryEmbedding` {#max.nn.rotary_embedding.OptimizedRotaryEmbedding} > *class* max.nn.rotary\_embedding.OptimizedRotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True) Optimized version of RotaryEmbedding using 2D frequency tensor representation. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`DeviceRef` ) * **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **\_freqs\_cis** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `freqs_cis` {#max.nn.rotary_embedding.OptimizedRotaryEmbedding.freqs_cis} > *property* freqs\_cis ## `RotaryEmbedding` {#max.nn.rotary_embedding.RotaryEmbedding} > *class* max.nn.rotary\_embedding.RotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True) RotaryEmbedding layer to calculate and apply the frequency tensor for complex exponentials. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`DeviceRef` ) * **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **\_freqs\_cis** (`Value` `[` `TensorType` `]` `|` [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) `|` [`Shape`](../graph/type.md#max.graph.type.Shape) `|` [`Dim`](../graph/type.md#max.graph.type.Dim) `|` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`float`](https://docs.python.org/3/library/functions.html#float) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `|` [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) `|` [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `|` `None` ) * **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `compute_scale()` {#max.nn.rotary_embedding.RotaryEmbedding.compute_scale} > compute\_scale(user\_scale=None) **Parameters:** **user\_scale** ([`float`](https://docs.python.org/3/library/functions.html#float) `|` `None` ) **Return type:** [float](https://docs.python.org/3/library/functions.html#float) ### `device` {#max.nn.rotary_embedding.RotaryEmbedding.device} > device\*: DeviceRef\* ### `dim` {#max.nn.rotary_embedding.RotaryEmbedding.dim} > dim\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `freqs_cis` {#max.nn.rotary_embedding.RotaryEmbedding.freqs_cis} > *property* freqs\_cis\*: [TensorValue](../graph/TensorValue.md#max.graph.TensorValue)\* ### `freqs_cis_base()` {#max.nn.rotary_embedding.RotaryEmbedding.freqs_cis_base} > freqs\_cis\_base() Computes the frequency tensor for complex exponentials (cis) for a given seq\_len. Tensor is scaled with theta parameter. Required to apply Rotary Position Embedding (RoPE) to tensor. See ‘Roformer: Enhanced Transformer with Rotary Embedding’ (arxiv.org/pdf/2104.09864). **Returns:** The frequency tensor for complex exponentials with shape (max\_seq\_len \* 2, head\_dim / 2, 2) **Return type:** [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue) ### `head_dim` {#max.nn.rotary_embedding.RotaryEmbedding.head_dim} > head\_dim\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None* head\_dim = dim // n\_heads if not specified in the config. ### `interleaved` {#max.nn.rotary_embedding.RotaryEmbedding.interleaved} > interleaved\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True* ### `max_seq_len` {#max.nn.rotary_embedding.RotaryEmbedding.max_seq_len} > max\_seq\_len\*: [int](https://docs.python.org/3/library/functions.html#int)\* The maximum sequence length for model’s input. ### `n_heads` {#max.nn.rotary_embedding.RotaryEmbedding.n_heads} > n\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\* ### `theta` {#max.nn.rotary_embedding.RotaryEmbedding.theta} > theta\*: [float](https://docs.python.org/3/library/functions.html#float)\* Hyperparameter used to control the frequency scaling of the sinusoidal components of the embeddings. --- ## rotate_bits_left `rotate_bits_left[shift: Int](x: Int) -> Int` Shifts the bits of an input to the left by `shift` bits (with wrap-around). **Constraints:** `-size shift (`Int`): The number of bit positions by which to rotate the bits of the integer to the left (with wrap-around). **Args:** * ​x (`Int`): The input value. **Returns:** The input rotated to the left by `shift` elements (with wrap-around). `rotate_bits_left[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Shifts bits to the left by `shift` positions (with wrap-around) for each element of a SIMD vector. **Constraints:** `0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned. * ​width (`Int`): The width of the SIMD vector. * ​shift (`Int`): The number of positions to rotate left. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector input. **Returns:** SIMD vector with each element rotated left by `shift` bits. --- ## rotate_bits_right `rotate_bits_right[shift: Int](x: Int) -> Int` Shifts the bits of an input to the right by `shift` bits (with wrap-around). **Constraints:** `-size shift (`Int`): The number of bit positions by which to rotate the bits of the integer to the right (with wrap-around). **Args:** * ​x (`Int`): The input value. **Returns:** The input rotated to the right by `shift` elements (with wrap-around). `rotate_bits_right[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Shifts bits to the right by `shift` positions (with wrap-around) for each element of a SIMD vector. **Constraints:** `0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned. * ​width (`Int`): The width of the SIMD vector. * ​shift (`Int`): The number of positions to rotate right. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector input. **Returns:** SIMD vector with each element rotated right by `shift` bits. --- ## round `round[T: Roundable, //](number: T) -> T` Get the rounded value of the given object. **Parameters:** * ​T (`Roundable`): The type conforming to Roundable. **Args:** * ​number (`T`): The object to get the rounded value of. **Returns:** The rounded value of the object. `round[T: Roundable, //](number: T, ndigits: Int) -> T` Get the value of this object, rounded to a specified number of digits after the decimal point. **Parameters:** * ​T (`Roundable`): The type conforming to Roundable. **Args:** * ​number (`T`): The object to get the rounded value of. * ​ndigits (`Int`): The number of digits to round to. **Returns:** The rounded value of the object. --- ## Roundable The `Roundable` trait describes a type that defines a rounding operation. Types that conform to `Roundable` will work with the builtin `round` function. The round operation always returns the same type as the input. For example: ```mojo @fieldwise_init struct Complex(Roundable): var re: Float64 var im: Float64 fn __round__(self) -> Self: return Self(round(self.re), round(self.im)) fn __round__(self, ndigits: Int) -> Self: return Self(round(self.re, ndigits), round(self.im, ndigits)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__round__` `__round__(self: _Self) -> _Self` Get a rounded value for the type. **Returns:** The rounded value. `__round__(self: _Self, ndigits: Int) -> _Self` Get a rounded value for the type. **Args:** * ​ndigits (`Int`): Number of digits after the decimal point. **Returns:** The rounded value. --- ## RoundMode `struct RoundMode` ## Fields * ​value (`Int`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `Ceil` `alias Ceil = RoundMode(3)` ### `Floor` `alias Floor = RoundMode(2)` ### `HalfDown` `alias HalfDown = RoundMode(0)` ### `HalfUp` `alias HalfUp = RoundMode(1)` ## Methods ### `__init__` `@implicit` `__init__(out self, value: Int)` ### `__eq__` `__eq__(self, other: Self) -> Bool` --- ## RTLD `struct RTLD` Enumeration of the RTLD flags used during dynamic library loading. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `GLOBAL` `alias GLOBAL = 256 if os_is_linux() else 8` Make symbols available for symbol resolution of subsequently loaded libraries. ### `LAZY` `alias LAZY = 1` Load library lazily (defer function resolution until needed). ### `LOCAL` `alias LOCAL = 4` Make symbols not available for symbol resolution of subsequently loaded libraries. ### `NOW` `alias NOW = 2` Load library immediately (resolve all symbols on load). --- ## run `run[func: fn() raises -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() raises -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. `run[func: fn() -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. `run[: origin.set, //, func: fn() raises capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() raises capturing -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. `run[: origin.set, //, func: fn() capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report` Benchmarks the function passed in as a parameter. Benchmarking continues until 'min\_time\_ns' has elapsed and either `max_time_ns` OR `max_iters` is achieved. **Parameters:** * ​func (`fn() capturing -> None`): The function to benchmark. **Args:** * ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`). * ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`). * ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`). * ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement. **Returns:** Average execution time of func in ns. --- ## run `run(cmd: String) -> String` Runs the specified command and returns the output as a string. This function executes the given command in a subprocess, captures its standard output, and returns it as a string. It automatically handles opening and closing the subprocess. **Args:** * ​cmd (`String`): The command to execute as a string. **Returns:** The standard output of the command as a string, with trailing whitespace removed. **Raises:** This function raises if: * The command cannot be executed. * There is an IO error reading from the subprocess. * The data written by the subprocess is not valid UTF-8. --- ## run_radix_sort_pairs_gpu `run_radix_sort_pairs_gpu[type: DType, out_idx_type: DType, rank: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](ctx: DeviceContext, mut input_keys: NDBuffer[type, rank, MutableAnyOrigin], mut output_keys: NDBuffer[type, rank, MutableAnyOrigin], mut input_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], mut output_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], skip_sort: NDBuffer[bool, rank, origin])` --- ## runtime Implements the runtime package. ## Modules * [​`asyncrt`](/mojo/stdlib/runtime/asyncrt/): This module implements the low level concurrency library. * [​`tracing`](/mojo/stdlib/runtime/tracing/): Provides tracing utilities. --- ## runtime_layout Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time. You can import these APIs from `layout.runtime_layout`. ```mojo from layout.runtime_layout import RuntimeLayout, make_layout ``` ## Structs * [​`RuntimeLayout`](./RuntimeLayout): A runtime-configurable layout that uses `RuntimeTuple` for storage. ## Functions * [​`coalesce`](./coalesce): Coalesce adjacent dimensions in a runtime layout when possible. * [​`make_layout`](./make_layout): Combine two runtime layouts into a single composite layout. --- ## runtime_tuple Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates. Key features: * Hybrid compile-time/runtime value handling * Optimized for parallel execution and hardware acceleration * Support for nested tuple structures * Efficient conversion between linear indices and multi-dimensional coordinates * Specialized operations for tensor shape calculations The module includes functions for tuple manipulation (concatenation, flattening), coordinate transformations (`idx2crd`, `crd2idx`), and specialized tensor operations like shape division and prefix products. ## Structs * [​`RuntimeTuple`](./RuntimeTuple): A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values. ## Functions * [​`concat`](./concat): Concatenates two `IntTuple` instances into a single `IntTuple`. * [​`crd2idx`](./crd2idx): Converts multi-dimensional coordinates to a linear index. * [​`idx2crd`](./idx2crd): Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements. * [​`is_int`](./is_int): Determines if a `RuntimeTuple` represents a scalar integer value. * [​`is_tuple`](./is_tuple): Determines if a `RuntimeTuple` represents a tuple rather than a scalar value. * [​`prefix_product`](./prefix_product): Computes the prefix products of elements in the `RuntimeTuple`. * [​`product`](./product): Computes the product of all elements in the `RuntimeTuple`. * [​`shape_div`](./shape_div): Performs specialized shape division between `RuntimeTuple`s. * [​`signum`](./signum): Returns the sign of an integer value. --- ## RuntimeLayout `@register_passable(trivial)` `struct RuntimeLayout[layout: Layout, /, *, element_type: DType = int64, linear_idx_type: DType = int64]` A runtime-configurable layout that uses `RuntimeTuple` for storage. This struct provides a layout implementation that can be modified at runtime, unlike the static [`Layout`](/mojo/stdlib/layout/layout/Layout) type. It uses [`RuntimeTuple`](/mojo/stdlib/layout/runtime_tuple/RuntimeTuple) for shape and stride storage. The layout must have statically known dimensions at compile time, but the actual shape and stride values can be modified during execution. ## Parameters * ​layout (`Layout`): The static `Layout` type to base this runtime layout on. * ​element\_type (`DType`): The integer type of the each dimension element. Must be signed. * ​linear\_idx\_type (`DType`): The integer type of the linear index into memory returned by `crd2idx`. Must be signed. ## Fields * ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): The shape of the layout as a runtime tuple. Stores the size of each dimension. Uses the specified bitwidth and is unsigned. Must match the static layout's shape dimensions. * ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): The stride of the layout as a runtime tuple. Stores the stride (step size) for each dimension. Uses 64-bit unsigned integers since strides can be large values. Must match the static layout's stride dimensions. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Initialize a `RuntimeLayout` with default values. Creates a new `RuntimeLayout` instance with default shape and stride values. Requires that the static layout has known dimensions at compile time. **Constraints:** The static layout that this runtime layout is based on must have all dimensions known. `__init__(shape: RuntimeTuple[layout.shape, element_type=element_type], stride: RuntimeTuple[layout.stride, element_type=linear_idx_type]) -> Self` Initialize a `RuntimeLayout` with specified shape and stride. **Args:** * ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): A `RuntimeTuple` containing the dimensions of each axis. * ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): A `RuntimeTuple` containing the stride values for each axis. ### `__call__` `__call__(self, idx: Int) -> SIMD[linear_idx_type, 1]` Convert a single index to a flat linear index. **Args:** * ​idx (`Int`): The one-dimensional index to convert. **Returns:** The corresponding flat linear index in the layout. `__call__[: ImmutableOrigin, //, t: IntTuple[$0]](self, idx: RuntimeTuple[t, element_type=element_type]) -> SIMD[linear_idx_type, 1]` Convert a multi-dimensional index to a flat linear index. **Parameters:** * ​t (`IntTuple[$0]`): The `IntTuple` type for the index. **Args:** * ​idx (`RuntimeTuple[t, element_type=element_type]`): A `RuntimeTuple` containing the multi-dimensional coordinates. **Returns:** The corresponding flat linear index in the layout. ### `size` `size(self) -> Int` Calculate the total number of elements in the layout. **Returns:** The product of all dimensions in the shape, representing the total number of elements that can be addressed by this layout. ### `bound_check_required` `bound_check_required(self) -> Bool` Determine if bounds checking is required for this layout. **Returns:** True if any dimension in the shape differs from the static layout's shape, False otherwise. ### `cast` `cast[element_type: DType, /, *, linear_idx_type: DType = linear_idx_type](self) -> RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]` Cast the layout to use a different element bitwidth. **Parameters:** * ​element\_type (`DType`): The target data type. * ​linear\_idx\_type (`DType`): The target linear idx type. **Returns:** A new `RuntimeLayout` with the shape cast to the specified type. ### `__str__` `__str__(self) -> String` Convert the layout to a string representation. **Returns:** A string representation of the layout. ### `row_major` `static row_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self` Create a row-major layout from the given shape. In row-major layout, elements with adjacent rightmost indices are adjacent in memory. **Parameters:** * ​rank (`Int`): The number of dimensions in the layout. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis. **Returns:** A `RuntimeLayout` with row-major stride ordering. ### `col_major` `static col_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self` Create a column-major layout from the given shape. In column-major layout, elements with adjacent leftmost indices are adjacent in memory. **Parameters:** * ​rank (`Int`): The number of dimensions in the layout. **Args:** * ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis. **Returns:** A `RuntimeLayout` with column-major stride ordering. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write a string representation of the layout to a writer. **Parameters:** * ​W (`Writer`): The `Writer` type. **Args:** * ​writer (`W`): The `Writer` object to write the layout representation to. ### `sublayout` `sublayout[i: Int](self) -> RuntimeLayout[layout[i], element_type=element_type, linear_idx_type=linear_idx_type]` Extract a nested sublayout at the specified index. **Parameters:** * ​i (`Int`): The index of the nested layout to extract. **Returns:** A `RuntimeLayout` representing the nested layout at index i. ### `dim` `dim(self, i: Int) -> Int` Get the size of the dimension at the specified index. **Args:** * ​i (`Int`): The index of the dimension to retrieve. **Returns:** The size of the dimension at index `i`. ### `__len__` `static __len__() -> Int` Get the number of dimensions in the layout. **Returns:** The number of dimensions (rank) of the layout. --- ## RuntimeTensorSpec `@register_passable(trivial)` `struct RuntimeTensorSpec[type: DType, rank: Int]` ## Fields * ​shape (`IndexList[rank]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__getitem__` `__getitem__(self, idx: Int) -> Int` ### `bytecount` `bytecount(self) -> Int` Gets the total byte count. **Returns:** The total byte count. --- ## RuntimeTuple `@register_passable(trivial)` `struct RuntimeTuple[origin: ImmutableOrigin, //, S: IntTuple[origin] = IntTuple(-1), /, *, element_type: DType = int64]` A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values. ## Parameters * ​origin (`ImmutableOrigin`): The origin corresponding to the `IntTuple`. * ​S (`IntTuple[origin]`): `IntTuple` with compile-time known values (or `UNKNOWN_VALUE` for runtime values). * ​element\_type (`DType`): Integer type of the underlying elements. ## Fields * ​value (`IndexList[len[::Sized](flatten[::Origin[::Bool(S)), element_type=element_type]`): Storage for the actual tuple values, implemented as an IndexList with the appropriate size and element type. ## Implemented traits `AnyType`, `Copyable`, `Intable`, `Movable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `scalar_length` `alias scalar_length = len[::Sized](flatten[::Origin[::Bool(S))` The total number of scalar elements in this RuntimeTuple after flattening nested tuples. ## Methods ### `__init__` `__init__() -> Self` Initialize a `RuntimeTuple` with default values. For dimensions with known compile-time values in S, uses those values. For unknown dimensions, initializes them to UNKNOWN\_VALUE. `@implicit` `__init__(*values: Int) -> Self` Initialize a `RuntimeTuple` with the provided values. **Args:** * ​\*values (`Int`): Variadic number of integer values to initialize the tuple with. `@implicit` `__init__[l: Int](values: IndexList[l, element_type=element_type]) -> Self` Initialize a `RuntimeTuple` from an `IndexList`. **Parameters:** * ​l (`Int`): Compile-time length of the input `IndexList`. **Args:** * ​values (`IndexList[l, element_type=element_type]`): `IndexList` to initialize from. Must have same length as the `RuntimeTuple`. The values will be cast to the appropriate element type if needed. ### `__getitem__` `__getitem__[i: Int](self) -> RuntimeTuple[S[i], element_type=element_type]` Retrieves the element at the specified index in the tuple. This method provides array-like indexing for RuntimeTuple, allowing access to individual elements or sub-tuples. It handles the internal offset calculation to access the correct elements in the flattened storage array. **Parameters:** * ​i (`Int`): The index of the element to retrieve. **Returns:** A new `RuntimeTuple` containing the element or sub-tuple at the specified index. ### `__setitem__` `__setitem__[i: Int](mut self, val: SIMD[element_type, 1])` Sets the value of the element at the specified index in the tuple. This method enables array-like assignment for RuntimeTuple elements, handling the internal offset calculation to modify the correct element in the flattened storage array. **Parameters:** * ​i (`Int`): The index of the element to modify. **Args:** * ​val (`SIMD[element_type, 1]`): The new value to assign to the element. ### `offset_until` `static offset_until[i: Int]() -> Int` Calculates the offset in the flattened value array for a given tuple index. This method computes the sum of lengths of all flattened subtuple elements that come before the specified index, which is used for indexing into the internal storage. **Parameters:** * ​i (`Int`): The tuple index to calculate the offset for. **Returns:** The offset in the flattened array where the i-th element begins. ### `get_int` `get_int(self) -> SIMD[element_type, 1]` Returns the integer value of this RuntimeTuple. For tuples with a known compile-time value, returns that value. For tuples with a runtime value, returns the first element of the internal storage array. **Returns:** The integer value of this RuntimeTuple. ### `__str__` `__str__(self) -> String` Converts the RuntimeTuple to its string representation. This method provides a human-readable string representation of the tuple, which is useful for debugging and logging. **Returns:** A string representation of the `RuntimeTuple`. ### `concat` `concat[: ImmutableOrigin, //, R: IntTuple[$0]](self, rhs: RuntimeTuple[R, element_type=element_type]) -> RuntimeTuple[concat[::Origin[::Bool(S, R), element_type=element_type]` Concatenates two `RuntimeTuple`s together. This method combines the current `RuntimeTuple` with another one, preserving both compile-time and runtime values. It handles the complexity of merging the underlying storage arrays while maintaining the proper semantic structure. **Parameters:** * ​R (`IntTuple[$0]`): The `IntTuple` type parameter of the right-hand side RuntimeTuple. **Args:** * ​rhs (`RuntimeTuple[R, element_type=element_type]`): The `RuntimeTuple` to concatenate to the end of this one. **Returns:** A new `RuntimeTuple` containing all elements from both tuples in sequence. ### `flatten` `flatten(self) -> RuntimeTuple[flatten[::Origin[::Bool(S), element_type=element_type]` Flattens a potentially nested `RuntimeTuple` into a single-level tuple. This method converts a hierarchical structure of tuples into a flat representation, preserving all values but removing the nested structure. This is useful for operations that need to treat all elements uniformly. **Returns:** A new `RuntimeTuple` containing all elements in a flat (non-nested) structure. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes the RuntimeTuple to a Writer object. This method is used by the string conversion system to generate a string representation of the RuntimeTuple. It handles both scalar values and nested tuple structures, producing a properly formatted output. **Parameters:** * ​W (`Writer`): The Writer type to use for output. **Args:** * ​writer (`W`): The Writer object to write the string representation to. ### `__len__` `__len__(self) -> Int` Returns the length (number of top-level elements) of the `RuntimeTuple`. This method provides the standard Python-like len() functionality, giving the number of elements at the top level of the tuple structure. For nested tuples, this returns the number of first-level entries, not the total number of scalar values. **Returns:** The number of top-level elements in the tuple. ### `cast` `cast[type: DType](self) -> RuntimeTuple[S, element_type=type]` Casts the RuntimeTuple to use a different numeric type. This method creates a new RuntimeTuple with the same structure and values but using a different underlying numeric type for storage. This is useful for changing precision or signedness of the data. **Parameters:** * ​type (`DType`): The target DType to cast the elements to. **Returns:** A new `RuntimeTuple` with elements cast to the specified type. ### `__int__` `__int__(self) -> Int` Converts the RuntimeTuple to an integer value. This method enables implicit conversion of a RuntimeTuple to an integer, but is constrained to only work on scalar tuples (those that contain a single value). **Returns:** The integer value of the tuple. --- ## S_ISBLK `S_ISBLK[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a block device. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a block device and False otherwise. --- ## S_ISCHR `S_ISCHR[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a character device. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a character device and False otherwise. --- ## S_ISDIR `S_ISDIR[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a directory. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a directory and False otherwise. --- ## S_ISFIFO `S_ISFIFO[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a fifo. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a fifo and False otherwise. --- ## S_ISLNK `S_ISLNK[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a symlink. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a symlink and False otherwise. --- ## S_ISREG `S_ISREG[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a regular file. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a regular file and False otherwise. --- ## S_ISSOCK `S_ISSOCK[intable: Intable](mode: intable) -> Bool` Returns True if the mode is a socket. **Parameters:** * ​intable (`Intable`): A type conforming to Intable. **Args:** * ​mode (`intable`): The file mode. **Returns:** True if the mode is a socket and False otherwise. --- ## sampling ## `rejection_sampler()` {#max.pipelines.lib.sampling.rejection_sampler} > max.pipelines.lib.sampling.rejection\_sampler(top\_k, device) **Parameters:** * **top\_k** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **device** (`DeviceRef` ) **Return type:** [*Graph*](../graph/Graph.md#max.graph.Graph) ## `token_sampler()` {#max.pipelines.lib.sampling.token_sampler} > max.pipelines.lib.sampling.token\_sampler(sampling\_config, device, return\_logits=False) **Parameters:** * **sampling\_config** (`SamplingConfig` ) * **device** (`DeviceRef` ) * **return\_logits** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*Graph*](../graph/Graph.md#max.graph.Graph) --- ## sampling ## Functions * [​`apply_penalties_to_logits`](./apply_penalties_to_logits): Apply penalties to the logits based on the frequency of the tokens in the batch. * [​`update_frequency_data`](./update_frequency_data): Update the frequency data for the given new tokens. --- ## scalb `scalb[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `scalb` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​arg0 (`SIMD[dtype, width]`): The first input argument. * ​arg1 (`SIMD[dtype, width]`): The second input argument. **Returns:** The `scalb` of the inputs. --- ## scale_and_mask_helper `scale_and_mask_helper[p_type: DType, p_layout: Layout, mask_t: MHAMask, score_mod_t: ScoreModTrait, group: Int, num_n_mmas: Int, WN: Int, MMA_N: Int, simd_width: Int, use_score_mod: Bool = False](p_reg_tile: LayoutTensor[p_type, p_layout, origin, address_space=AddressSpace(5)], scale: SIMD[float32, 1], num_keys: UInt, bound: UInt, lane: UInt, warp: UInt, mask: mask_t, score_mod: score_mod_t, kv_tile_start_row: Int, mask_stride: UInt, max_seq_len: Int)` --- ## scale_min_k4 `scale_min_k4(src_ptr: UnsafePointer[block_Q4_K], g: Int) -> Tuple[SIMD[float32, 1], SIMD[float32, 1]]` --- ## scatter `scatter[dtype: DType, size: Int, //](value: SIMD[dtype, size], owned base: SIMD[index, size], mask: SIMD[bool, size], alignment: Int = 0)` Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers. The scatter operation stores scalar values from a SIMD vector of memory locations and scatters them into a vector of pointers. The memory locations are provided in the vector of pointers `base` as addresses. The memory is stored according to the provided mask. The mask holds a bit for each vector lane, and is used to prevent memory accesses to the masked-off lanes. The `value` operand is a vector value to be written to memory. The `base` operand is a vector of pointers, pointing to where the value elements should be stored. It has the same underlying type as the value operand. The `mask` operand, mask, is a vector of boolean values. The types of the `mask` and the `value` operand must have the same number of vector elements. Scatter with overlapping addresses is guaranteed to be ordered from least-significant to most-significant element. In general, for some vector `value`, vector of pointers `base`, and mask `mask` a call of the form: ```mojo scatter(value, base, mask) ``` is equivalent to the following sequence of scalar stores in C++: ```cpp for (int i = 0; i dtype (`DType`): DType of `value`, the result SIMD buffer. * ​size (`Int`): Size of `value`, the result SIMD buffer. **Args:** * ​value (`SIMD[dtype, size]`): The vector that will contain the result of the scatter operation. * ​base (`SIMD[index, size]`): The vector containing memory addresses that scatter will access. * ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of the base vector. * ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power of two constant integer value. --- ## scatter_elements `scatter_elements[reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], rank: Int, input_type: DType, indices_type: DType](input: ManagedTensorSlice[io_spec, static_spec=static_spec], indices: ManagedTensorSlice[io_spec, static_spec=static_spec], updates: ManagedTensorSlice[io_spec, static_spec=static_spec], _axis: Int, output: ManagedTensorSlice[io_spec, static_spec=static_spec])` Implements ONNX ScatterElements op which is equivalent to Pytorch scatter. --- ## scatter_elements_shape `scatter_elements_shape[rank: Int, input_type: DType, indices_type: DType, //, *, single_thread_blocking_override: Bool](input: NDBuffer[input_type, rank, origin], updates: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], axis: Int) -> IndexList[rank]` Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible. **Parameters:** * ​rank (`Int`): Rank of the input tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[input_type, rank, origin]`): The input tensor. * ​updates (`NDBuffer[input_type, rank, origin]`): The input tensor. * ​indices (`NDBuffer[indices_type, rank, origin]`): The indices tensor. * ​axis (`Int`): The axis. **Returns:** The output shape. --- ## scatter_nd `scatter_nd[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())` Scatter\_nd operation without any reduction. --- ## scatter_nd_generator `scatter_nd_generator[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), /, reduce_fn: OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), *, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("scatter_nd")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())` Implements ONNX ScatterND operation as defined in . **Parameters:** * ​output\_type (`DType`): Type of data, updates, and output tensors. * ​indices\_type (`DType`): Type of the indices tensor. * ​data\_rank (`Int`): Rank of input (data) tensor (data\_rank >= 1). * ​indices\_rank (`Int`): Rank of input (data) tensor (indices\_rank >= 1). * ​updates\_rank (`Int`): Rank of updates tensor (updates\_rank = data\_rank + indices\_rank - indices\_shape\[-1] - 1). * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): Target cpu or cuda. * ​reduce\_fn (`OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]`): Reduction function to apply: none (default), add, mul, max, min. * ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): A description of the function, used for profiling and tracing. **Args:** * ​data (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank >= 1. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank containing indices for the scatter operation. * ​updates (`NDBuffer[output_type, updates_rank, origin]`): Tensor containing values to update output tensor based on indices tensor. * ​output (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank, shaped the same as data tensor. * ​context (`DeviceContextPtr`): Pointer to DeviceContext. --- ## scatter_nd_shape `scatter_nd_shape[input_rank: Int, updates_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[input_type, input_rank, origin], updates: NDBuffer[input_type, updates_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[input_rank]` Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible. **Parameters:** * ​input\_rank (`Int`): Rank of the input tensor. * ​updates\_rank (`Int`): Rank of the updates tensor. * ​indices\_rank (`Int`): Rank of the indices tensor. * ​input\_type (`DType`): Type of the input tensor. * ​indices\_type (`DType`): Type of the indices tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input (`NDBuffer[input_type, input_rank, origin]`): The input tensor. * ​updates (`NDBuffer[input_type, updates_rank, origin]`): The input tensor. * ​indices (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor. **Returns:** The output shape. --- ## schedule_barrier `schedule_barrier(mask: AMDScheduleBarrierMask = AMDScheduleBarrierMask(0))` Controls instruction scheduling across a barrier point in AMD GPU code. This function creates a scheduling barrier that controls which types of instructions can be reordered across it by the compiler. The mask parameter specifies which instruction categories (ALU, memory, etc) are allowed to cross the barrier during scheduling optimization. Note: This function only has an effect on AMD GPUs. On other platforms it will raise a compile time error. **Args:** * ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction types can be scheduled across this barrier. Default is NONE, meaning no instructions can cross. --- ## schedule_group_barrier `schedule_group_barrier(mask: AMDScheduleBarrierMask, size: SIMD[int32, 1], sync_id: SIMD[int32, 1])` Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups. This function creates a scheduling barrier that groups instructions into sequences with custom ordering. It affects the code that precedes the barrier. The barrier ensures instructions are scheduled according to the specified group parameters. Note: This function only has an effect on AMD GPUs. On other platforms it will raise a compile time error. The sync\_id parameter allows creating multiple schedule groups that can be ordered relative to each other. **Args:** * ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction types can be scheduled across this barrier. Similar to schedule\_barrier masks. * ​size (`SIMD[int32, 1]`): The number of times to repeat the instruction sequence in the schedule group. * ​sync\_id (`SIMD[int32, 1]`): A unique identifier for the group that determines the ordering of instructions within the same schedule group. --- ## Scope `struct Scope` Represents memory synchronization scope levels for GPU memory operations. Defines different scopes of memory visibility and synchronization, from thread-local to system-wide. Each scope level determines how memory operations are ordered and visible across different execution units. The scope levels form a hierarchy, with each higher level providing stronger ordering guarantees but potentially higher synchronization costs. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `Movable`, `UnknownDestructibility`, `Writable` ## Aliases ### `BLOCK` `alias BLOCK = Scope(3)` Block-level scope. Memory operations ordered within a thread block/CTA. ### `CLUSTER` `alias CLUSTER = Scope(4)` Cluster-level scope. Memory operations ordered within a thread block cluster. ### `GPU` `alias GPU = Scope(5)` GPU-level scope. Memory operations are ordered across all threads on the GPU. ### `NONE` `alias NONE = Scope(0)` No memory ordering guarantees. Operations may be reordered freely. ### `SYSTEM` `alias SYSTEM = Scope(6)` System-wide scope. Memory operations ordered across the entire system. ### `THREAD` `alias THREAD = Scope(1)` Thread-level scope. Memory operations are ordered within a single thread. ### `WARP` `alias WARP = Scope(2)` Warp-level scope. Memory operations are ordered within a warp of threads. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `Scope` instances are equal. Uses pointer comparison for efficiency. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the instances are the same, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `Scope` instances are not equal. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the instances are different, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Checks if two `Scope` instances have the same value. Compares the underlying integer values. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the values are the same, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Checks if two `Scope` instances have different values. **Args:** * ​other (`Self`): The other `Scope` instance to compare with. **Returns:** True if the values are different, False otherwise. ### `write_to` `write_to[W: Writer](self, mut w: W)` Writes the string representation of the scope to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface. **Args:** * ​w (`W`): The writer to write to. ### `__str__` `__str__(self) -> String` Returns the string representation of the memory scope. **Returns:** A string representation of the memory scope. ### `__repr__` `__repr__(self) -> String` Returns the string representation of the memory scope. **Returns:** A string representation of the memory scope. ### `mnemonic` `mnemonic(self) -> StringSlice[StaticConstantOrigin]` Returns the mnemonic string representation of the memory scope. Converts the memory scope level into a string mnemonic used by LLVM/NVVM intrinsics for memory operations. **Returns:** A string literal containing the mnemonic. --- ## ScoreModTrait The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `name_str` `alias name_str` ## Methods ### `score_mod` `score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]` Return score vector at given coordinates given a score\_mod. Arguments: coord is (seq\_id, head, q\_idx, k\_idx) score\_vec is at `coord` of the score matrix Score\_mod calculates a tensor given the functor and adds to score\_vec. --- ## seed `seed()` Seeds the random number generator using the current time. `seed(a: Int)` Seeds the random number generator using the value provided. **Args:** * ​a (`Int`): The seed value. --- ## select_config `select_config[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False](M: Int, N: Int, K: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]` --- ## select_inner_kernel `select_inner_kernel[a_type: DType, b_type: DType, c_type: DType]() -> InnerKernelID` --- ## select_k_atom `select_k_atom[type: DType, swizzle_mode: TensorMapSwizzle]() -> Layout` Creates a core matrix layout for tensor core operations. Constructs the fundamental atomic layout for tensor core operations based on the specified data type and swizzle mode. This layout represents the minimal dense matrix structure that can be efficiently processed by tensor cores. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode. **Returns:** `Layout` - A core matrix layout optimized for tensor core operations. --- ## Self-attention Self-attention is a mechanism in a [transformer](transformer.mdx) model that calculates the importance of different tokens (such as words) in a sequence, relative to each other. Each token is said to "attend to" all other tokens in the sequence by assigning an "attention score" to each one. In a large language model (LLM), self-attention allows the model to build an understanding of the whole text by evaluating how each word is relevant to all other words in the text, no matter how far they are from each other. The attention scores are computed using query, key, and value (QKV) vectors that pertain to each token: - The **query** is a vector that expresses what information a token is *looking for* among all the other tokens (like a search query). - The **key** is a vector that describes the information a token *offers* to other tokens (like an answer to a token's query). - The **value** is a vector that provides the **contextually-relevant information** about this token. After calculating attention scores by comparing the **query** and **key** vectors between tokens, self-attention uses the scores to apply weighted information from each token's **value** into a new [embedding](embedding.mdx) for each token. Thus, self-attention outputs a new token embedding for each token that carries information about its relationship with the other tokens in the sequence. The model also saves the calculated keys and values into the [KV cache](kv-cache.mdx) to avoid redundant recompute for the same tokens during the next [autoregression](autoregression.mdx) cycle. --- ## semaphore This module provides a device-wide semaphore implementation for NVIDIA GPUs. The Semaphore struct enables inter-CTA (Cooperative Thread Array) synchronization by providing atomic operations and memory barriers. It uses NVIDIA-specific intrinsics to implement efficient thread synchronization. Example: ```` ```mojo from gpu import Semaphore var lock = UnsafePointer[Int32](...) var sem = Semaphore(lock, thread_id) # Wait for a specific state sem.wait(0) # Release the semaphore sem.release(1) ``` ```` ## Structs * [​`Semaphore`](/mojo/stdlib/gpu/semaphore/Semaphore): A device-wide semaphore implementation for GPUs. --- ## Semaphore `@register_passable` `struct Semaphore` A device-wide semaphore implementation for GPUs. This struct provides atomic operations and memory barriers for inter-CTA synchronization. It uses a single thread per CTA to perform atomic operations on a shared lock variable. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(lock: UnsafePointer[SIMD[int32, 1]], thread_id: Int) -> Self` Initialize a new Semaphore instance. **Args:** * ​lock (`UnsafePointer[SIMD[int32, 1]]`): Pointer to shared lock variable in global memory. * ​thread\_id (`Int`): Thread ID within the CTA, used to determine if this thread should perform atomic operations. ### `fetch` `fetch(mut self)` Fetch the current state of the semaphore from global memory. Only the designated wait thread (thread 0) performs the actual load, using an acquire memory ordering to ensure proper synchronization. ### `state` `state(self) -> SIMD[int32, 1]` Get the current state of the semaphore. **Returns:** The current state value of the semaphore. ### `wait` `wait(mut self, status: Int = 0)` Wait until the semaphore reaches the specified state. Uses a barrier-based spin loop where all threads participate in checking the state. Only proceeds when the state matches the expected status. **Args:** * ​status (`Int`): The state value to wait for (defaults to 0). ### `release` `release(mut self, status: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0))` Release the semaphore by setting it to the specified state. Ensures all threads have reached this point via a barrier before the designated thread updates the semaphore state. **Args:** * ​status (`SIMD[int32, 1]`): The new state value to set (defaults to 0). --- ## sendmsg `sendmsg(opcode: SIMD[int32, 1], msg: SIMD[int32, 1])` Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages. **Args:** * ​opcode (`SIMD[int32, 1]`): The operation to perform. * ​msg (`SIMD[int32, 1]`): The message to send. --- ## SeqInfo `@register_passable(trivial)` `struct SeqInfo` ## Fields * ​seq\_len (`SIMD[uint32, 1]`): * ​start\_of\_seq (`SIMD[uint32, 1]`): * ​prompt\_offset (`SIMD[uint32, 1]`): * ​head\_idx (`SIMD[uint32, 1]`): * ​prompt\_idx (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(seq_len: SIMD[uint32, 1], start_of_seq: SIMD[uint32, 1], work: WorkInfo) -> Self` ### `is_valid` `is_valid(self) -> Bool` ### `create` `static create[ragged: Bool](work: WorkInfo, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self` --- ## sequential A General sequential layer, each layer is executed with the outputs of the previous. ## `Sequential` {#max.nn.sequential.Sequential} > *class* max.nn.sequential.Sequential(layers) A sequential stack of layers where each layer is called by the outputs of the previous layer. **Parameters:** **layers** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Layer`](layer.md#max.nn.layer.Layer) `]` ) --- ## Serverless GPU inference on Google Cloud Run import SmallCards from '@site/src/components/SmallCards'; Google Cloud Run is a fully managed compute platform that lets you run any container, making it a great option for deploying an AI endpoint with MAX. This tutorial guides you through the process of deploying Llama 3 with [MAX container](https://docs.modular.com/max/container/) on [Google Cloud Run](https://cloud.google.com/run), so you get automatic scaling and serverless deployment without managing any of the infrastructure yourself. ## Requirements Before starting this tutorial, ensure that you have: - A Google Cloud account with [billing enabled](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled#confirm_billing_is_enabled_on_a_project) - The `gcloud` CLI tool [installed](https://cloud.google.com/sdk/docs/install) and [initialized](https://cloud.google.com/sdk/docs/initializing) Also make sure your Google Cloud project has access to the necessary [quotas and system limits](https://cloud.google.com/docs/quotas/understand-limits). For more information on compatible GPUs, see [GCP's supported GPU types](https://cloud.google.com/run/docs/configuring/services/gpu#gpu-type). We recommend the following hardware resources: - **GPU**: NVIDIA L4 (or another [compatible GPU](/max/faq#gpu-requirements)) - **CPU**: 8 vCPUs - **Memory**: At least 32 GiB ## Deploy MAX to Cloud Run This section guides you through deploying the MAX container for Llama 3.1 inference on Google Cloud Run with GPU acceleration. 1. Before deploying, set up the required environment variables, including your Google Cloud [project ID](https://cloud.google.com/storage/docs/projects) and a [supported region](https://cloud.google.com/run/docs/configuring/services/gpu#supported-regions) for Cloud Run with GPUs. :::note Because Cloud Run with GPUs is in public preview, you should use a separate project for your GPU services, and not the same project that contains your other production workloads. ::: ```bash export PROJECT_ID="your-project-id" export REGION="us-central1" ``` 2. To use Cloud Run and Cloud Build, you must enable the necessary APIs: ```bash gcloud services enable \ run.googleapis.com \ cloudbuild.googleapis.com ``` 3. Now, deploy the MAX container to Cloud Run using the following command: ```bash gcloud beta run deploy max-nvidia-full \ --image=modular/max-nvidia-full \ --region=us-central1 \ --platform=managed \ --memory=32Gi \ --cpu=8 \ --timeout=1200 \ --port=8000 \ --min-instances=1 \ --max-instances=5 \ --concurrency=5 \ --cpu-boost \ --args="--model-path=modularai/Llama-3.1-8B-Instruct-GGUF" \ --allow-unauthenticated \ --gpu=1 \ --gpu-type=nvidia-l4 \ --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \ --startup-probe=tcpSocket.port=8000,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=5 ``` This last command deploys a Google Cloud Run service named `max-nvidia-full` using the `modular/max-nvidia-full` container image, allocating 32Gi of memory, 8 CPUs, and 1 NVIDIA L4 GPU in the `us-central1` region with autoscaling between 1 and 5 instances. The `--concurrency=5` flag limits each instance to handling a maximum of 5 concurrent requests, triggering a new instance if the limit is exceeded. You can adjust the maximum concurrent requests to balance throughput, latency, and cost. Lower `--concurrency` values reduce latency but require more instances, while higher values increase per-instance throughput but may raise latency. For guidance on tuning cost and performance tradeoffs to your specific use-case, see [Throughput versus latency versus cost tradeoffs](https://cloud.google.com/run/docs/tips/general#throughput-latency-cost-tradeoff). The command allows unauthenticated access and configures a startup probe on port 8000 that allows more time to download and start using a large language model. The model used here is the `modularai/Llama-3.1-8B-Instruct-GGUF` model. Once the deployment is complete, Cloud Run provides a service URL where you can send inference requests to the Llama 3.1 model. ## Test the deployment After deployment completes, you can test the OpenAI-compatible endpoint. 1. Get the Cloud Run service URL with the following command: ```bash SERVICE_URL=$(gcloud run services describe max-nvidia-full \ --region=us-central1 \ --format='value(status.url)') ``` 2. Send a chat completion inference request to the `max-nvidia-full` service. ```bash curl -N ${SERVICE_URL}/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Why is the sky blue?"} ] }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g' ``` ## Metrics Retrieve metrics about your Cloud Run service with the following command: ```bash gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=max-nvidia-full" --limit 10 ``` You can also check the [Google Cloud Run console](https://console.cloud.google.com/run) for visualizations and detailed metrics about your `max-nvidia-full` service. For more information on metrics and telemetry specific to the MAX container, see [Metrics](/max/container/#metrics). :::note MAX container metrics are anonymous by default. To help our team analyze your deployment performance, you can add identifying environment variables. For more information see [Deployment and user ID](/max/container/#deployment-and-user-id). ::: ## Cost considerations When deploying applications on Google Cloud Run, understanding pricing factors can help you manage costs effectively. Cloud Run follows a pay-per-use model, meaning you only pay for the exact resources consumed during request execution. ### Pricing factors Cloud Run pricing is based on several key components: - **Request count**: You are billed per HTTP request processed by your service. - **Resource allocation**: The cost varies depending on the allocated CPU, memory, and (if applicable) GPU resources. - **Request duration**: You pay for the time each request takes to execute, measured in milliseconds. See [Cloud Run pricing](https://cloud.google.com/run/pricing) for more information on pricing details. ### Cost optimization strategies To minimize costs while maintaining performance, consider these optimization techniques: 1. **Right-size resources**: Start with minimal CPU and memory allocations during development and testing. Avoid over-provisioning unless necessary. 2. **Configure scaling wisely**: Set appropriate minimum and maximum instance limits to prevent unnecessary scaling and costs. 3. **Monitor cold starts**: If cold start latency affects performance, consider keeping a small number of instances always running, but balance this with cost trade-offs. 4. **Use spot instances**: For non-critical or batch workloads, spot instances can offer significant savings compared to standard pricing. ## Clean up After you're done testing your service, remove the deployment and free up resources with the following command: ```bash gcloud run services delete max-nvidia-full --region=${REGION} ``` ## Next steps MAX includes a benchmarking script that allows you to evaluate throughput, latency, and GPU utilization metrics. For more detailed instructions on benchmarking, please see [Benchmark MAX](https://github.com/modular/modular/tree/main/benchmark). To stay up to date with new releases, [sign up for our newsletter](https://www.modular.com/modverse#signup) and [join our community](https://www.modular.com/community). If you're interested in becoming a design partner to get early access and give us feedback, please [contact us](https://www.modular.com/company/contact). You can also explore other GPU deployment options with MAX. export const cards = [ { title: 'Deploy Llama 3 on GPU with MAX', link: '/max/tutorials/max-serve-local-to-cloud', description: `Learn how to deploy Llama 3 on GPU with MAX.`, }, { title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters', link: '/max/tutorials/deploy-max-serve-on-kubernetes', description: `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`, }, ]; --- ## Serving import MDXListing from '@site/src/components/Listing/MDXListing'; import TutorialStack from '@site/src/components/TutorialStack'; Our high-performance serving library provides an OpenAI-compatible REST endpoint, enabling a smooth transition from OpenAI services or other libraries like vLLM and SGLang. MAX handles the complete request lifecycle with built-in support for function calling, structured output, and more, plus a Python API for offline inference. ## Guides export const docs = [ '../model-formats.mdx', '*' ] ## Tutorials export const tutorials = [ 'start-a-chat-endpoint', 'run-embeddings-with-max-serve', 'deploy-llama-vision', ]; --- ## set Implements the Set datatype. ## Structs * [​`Set`](/mojo/stdlib/collections/set/Set): A set data type. --- ## Set `struct Set[T: KeyElement]` A set data type. O(1) average-case amortized add, remove, and membership check. ```mojo from collections import Set var set = { 1, 2, 3 } print(len(set)) # 3 set.add(4) for element in set: print(element[]) set -= Set[Int](3, 4, 5) print(set == Set[Int](1, 2)) # True print(set | Set[Int](0, 1) == Set[Int](0, 1, 2)) # True var element = set.pop() print(len(set)) # 1 ``` ## Parameters * ​T (`KeyElement`): The element type of the set. Must implement KeyElement. ## Implemented traits `AnyType`, `Boolable`, `Comparable`, `Copyable`, `EqualityComparable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `KeyElement`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, *ts: T, *, __set_literal__: Tuple[] = Tuple())` Construct a set from initial elements. **Args:** * ​\*ts (`T`): Variadic of elements to add to the set. * ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals. `@implicit` `__init__(out self, elements: List[T, hint_trivial_type])` Construct a set from a List of elements. **Args:** * ​elements (`List[T, hint_trivial_type]`): A vector of elements to add to the set. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy constructor. **Args:** * ​other (`Self`): The existing Set instance to copy from. ### `__bool__` `__bool__(self) -> Bool` Whether the set is non-empty or not. **Returns:** True if the set is non-empty, False if it is empty. ### `__lt__` `__lt__(self, other: Self) -> Bool` Overloads the other (`Self`): The set to compare against for the strict subset relationship. **Returns:** True if the set is a strict subset of the `other` set, False otherwise. ### `__le__` `__le__(self, other: Self) -> Bool` Overloads the other (`Self`): Another Set instance to check against. **Returns:** True if this set is a subset of the `other` set, False otherwise. ### `__eq__` `__eq__(self, other: Self) -> Bool` Set equality. **Args:** * ​other (`Self`): Another Set instance to check equality against. **Returns:** True if the sets contain the same elements and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Set inequality. **Args:** * ​other (`Self`): Another Set instance to check equality against. **Returns:** True if the sets are different and False otherwise. ### `__gt__` `__gt__(self, other: Self) -> Bool` Overloads the > operator for strict superset comparison of sets. **Args:** * ​other (`Self`): The set to compare against for the strict superset relationship. **Returns:** True if the set is a strict superset of the `other` set, False otherwise. ### `__ge__` `__ge__(self, other: Self) -> Bool` Overloads the >= operator for sets. Works like as `issuperset` method. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is a superset of the `other` set, False otherwise. ### `__contains__` `__contains__(self, t: T) -> Bool` Whether or not the set contains an element. **Args:** * ​t (`T`): The element to check membership in the set. **Returns:** Whether or not the set contains the element. ### `__sub__` `__sub__(self, other: Self) -> Self` Set subtraction. **Args:** * ​other (`Self`): Another Set instance to subtract from this one. **Returns:** A new set containing elements of this set, but not containing any elements which were in the `other` set. ### `__and__` `__and__(self, other: Self) -> Self` The set intersection operator. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. **Returns:** A new set containing only the elements which appear in both this set and the `other` set. ### `__or__` `__or__(self, other: Self) -> Self` The set union operator. **Args:** * ​other (`Self`): Another Set instance to union with this one. **Returns:** A new set containing any elements which appear in either this set or the `other` set. ### `__xor__` `__xor__(self, other: Self) -> Self` Overloads the ^ operator for sets. Works like as `symmetric_difference` method. **Args:** * ​other (`Self`): The set to find the symmetric difference with. **Returns:** A new set containing the symmetric difference of the two sets. ### `__isub__` `__isub__(mut self, other: Self)` In-place set subtraction. Updates the set to remove any elements from the `other` set. **Args:** * ​other (`Self`): Another Set instance to subtract from this one. ### `__iand__` `__iand__(mut self, other: Self)` In-place set intersection. Updates the set to contain only the elements which are already in the set and are also contained in the `other` set. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. ### `__ixor__` `__ixor__(mut self, other: Self)` Overloads the ^= operator. Works like as `symmetric_difference_update` method. Updates the set with the symmetric difference of itself and another set. **Args:** * ​other (`Self`): The set to find the symmetric difference with. ### `__ior__` `__ior__(mut self, other: Self)` In-place set union. Updates the set to contain all elements in the `other` set as well as keeping all elements it already contained. **Args:** * ​other (`Self`): Another Set instance to union with this one. ### `__len__` `__len__(self) -> Int` The size of the set. **Returns:** The number of elements in the set. ### `__hash__` `__hash__(self) -> UInt` A hash value of the elements in the set. The hash value is order independent, so s1 == s2 -> hash(s1) == hash(s2). **Returns:** A hash value of the set suitable for non-cryptographic purposes. ### `__str__` `__str__[U: KeyElement & Representable, //](self: Set[U]) -> String` Returns the string representation of the set. **Parameters:** * ​U (`KeyElement & Representable`): The type of the List elements. Must implement the `Representable` and `KeyElement` traits. **Returns:** The string representation of the set. ### `__repr__` `__repr__[U: KeyElement & Representable, //](self: Set[U]) -> String` Returns the string representation of the set. **Parameters:** * ​U (`KeyElement & Representable`): The type of the List elements. Must implement the `Representable` and `KeyElement` traits. **Returns:** The string representation of the set. ### `write_to` `write_to[W: Writer, U: KeyElement & Representable, //](self: Set[U], mut writer: W)` Write Set string representation to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writer trait. * ​U (`KeyElement & Representable`): The type of the List elements. Must implement the `Representable` and `KeyElement` traits. **Args:** * ​writer (`W`): The object to write to. ### `__iter__` `__iter__(ref self) -> _DictKeyIter[T, NoneType, self_is_origin._data]` Iterate over elements of the set, returning immutable references. **Returns:** An iterator of immutable references to the set elements. ### `add` `add(mut self, t: T)` Add an element to the set. **Args:** * ​t (`T`): The element to add to the set. ### `remove` `remove(mut self, t: T)` Remove an element from the set. **Args:** * ​t (`T`): The element to remove from the set. **Raises:** If the element isn't in the set to remove. ### `pop` `pop(mut self) -> T` Remove any one item from the set, and return it. As an implementation detail this will remove the first item according to insertion order. This is practically useful for breadth-first search implementations. **Returns:** The element which was removed from the set. **Raises:** If the set is empty. ### `union` `union(self, other: Self) -> Self` Set union. **Args:** * ​other (`Self`): Another Set instance to union with this one. **Returns:** A new set containing any elements which appear in either this set or the `other` set. ### `intersection` `intersection(self, other: Self) -> Self` Set intersection. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. **Returns:** A new set containing only the elements which appear in both this set and the `other` set. ### `difference` `difference(self, other: Self) -> Self` Set difference. **Args:** * ​other (`Self`): Another Set instance to find the difference with this one. **Returns:** A new set containing elements that are in this set but not in the `other` set. ### `update` `update(mut self, other: Self)` In-place set update. Updates the set to contain all elements in the `other` set as well as keeping all elements it already contained. **Args:** * ​other (`Self`): Another Set instance to union with this one. ### `intersection_update` `intersection_update(mut self, other: Self)` In-place set intersection update. Updates the set by retaining only elements found in both this set and the `other` set, removing all other elements. The result is the intersection of this set with `other`. **Args:** * ​other (`Self`): Another Set instance to intersect with this one. ### `difference_update` `difference_update(mut self, other: Self)` In-place set subtraction. Updates the set by removing all elements found in the `other` set, effectively keeping only elements that are unique to this set. **Args:** * ​other (`Self`): Another Set instance to subtract from this one. ### `issubset` `issubset(self, other: Self) -> Bool` Check if this set is a subset of another set. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is a subset of the `other` set, False otherwise. ### `isdisjoint` `isdisjoint(self, other: Self) -> Bool` Check if this set is disjoint with another set. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is disjoint with the `other` set, False otherwise. ### `issuperset` `issuperset(self, other: Self) -> Bool` Check if this set is a superset of another set. **Args:** * ​other (`Self`): Another Set instance to check against. **Returns:** True if this set is a superset of the `other` set, False otherwise. ### `symmetric_difference` `symmetric_difference(self, other: Self) -> Self` Returns the symmetric difference of two sets. **Args:** * ​other (`Self`): The set to find the symmetric difference with. **Returns:** A new set containing the symmetric difference of the two sets. ### `symmetric_difference_update` `symmetric_difference_update(mut self, other: Self)` Updates the set with the symmetric difference of itself and another set. **Args:** * ​other (`Self`): The set to find the symmetric difference with. ### `discard` `discard(mut self, value: T)` Remove a value from the set if it exists. Pass otherwise. **Args:** * ​value (`T`): The element to remove from the set. ### `clear` `clear(mut self)` Removes all elements from the set. This method modifies the set in-place, removing all of its elements. After calling this method, the set will be empty. --- ## setenv `setenv(owned name: String, owned value: String, overwrite: Bool = True) -> Bool` Changes or adds an environment variable. **Constraints:** The function only works on macOS or Linux and returns False otherwise. **Args:** * ​name (`String`): The name of the environment variable. * ​value (`String`): The value of the environment variable. * ​overwrite (`Bool`): If an environment variable with the given name already exists, its value is not changed unless `overwrite` is True. **Returns:** False if the name is empty or contains an `=` character. In any other case, True is returned. --- ## shallow_apply `shallow_apply[func: fn[ImmutableOrigin](IntTuple[$0]) -> Int](t: IntTuple[origin]) -> IntTuple` Apply a function to each top-level element of an `IntTuple`. Unlike `apply()`, this function only operates on the immediate children of the input tuple without recursing into nested tuples. **Parameters:** * ​func (`fn[ImmutableOrigin](IntTuple[$0]) -> Int`): Function that takes an `IntTuple` and returns an `Int`. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` whose elements will be transformed. **Returns:** A new `IntTuple` with the function applied to each top-level element. --- ## shape_div `shape_div(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple` Performs division operation between shape tuples. Handles four cases: 1. tuple-tuple: Performs shape\_div element-wise when dimensions match 2. tuple-int: Folds the division of b across each element of a Example: `shape_div((4,5,6),40)` -> `shape_div((1,5,6),10)` -> `shape_div((1,1,6),2)` -> `(1,1,3)` 3. int-tuple: Returns `shape_div(a, product(b))` 4. int-int: Enforces the divisibility condition `a % b == 0 || b % a == 0` when possible Returns `a / b` with rounding away from `0` (that is, `1` or `-1` when `a a (`IntTuple[origin]`): The dividend `IntTuple`. * ​b (`IntTuple[origin]`): The divisor `IntTuple`. **Returns:** A new `IntTuple` containing the result of the division operation --- ## shape_div `shape_div[: ImmutableOrigin, : ImmutableOrigin, //, a_t: IntTuple[$1], b_t: IntTuple[$0]](a: RuntimeTuple[a_t, element_type=element_type], b: RuntimeTuple[b_t, element_type=element_type]) -> RuntimeTuple[shape_div[::Origin[::Bool(a_t, b_t)]` Performs specialized shape division between `RuntimeTuple`s. This function implements a special division operation specifically designed for tensor shape calculations. Unlike standard division, it handles special cases: 1. If shapes are directly divisible (a % b == 0), returns a standard division (a // b) 2. If shapes are inversely divisible (b % a == 0), returns the signed reciprocal 3. If shapes are incompatible, aborts with an error This operation is essential for transformations between tensor layouts and computing broadcasting semantics. **Parameters:** * ​a\_t (`IntTuple[$1]`): Type of the first operand. * ​b\_t (`IntTuple[$0]`): Type of the second operand. **Args:** * ​a (`RuntimeTuple[a_t, element_type=element_type]`): The dividend `RuntimeTuple`. * ​b (`RuntimeTuple[b_t, element_type=element_type]`): The divisor `RuntimeTuple`. **Returns:** A new `RuntimeTuple` containing the result of the shape division. --- ## shapes ## Functions * [​`get_sliding_window_out_dim`](./get_sliding_window_out_dim): Return output dimension for a sliding window operation along some dimension. --- ## SharedMemBarrier `@register_passable(trivial)` `struct SharedMemBarrier` A hardware-accelerated synchronization primitive for GPU shared memory operations. This struct provides a barrier mechanism optimized for coordinating thread execution and memory transfers in GPU kernels, particularly for Tensor Memory Accelerator (TMA) operations. It enables efficient synchronization between threads and memory operations by leveraging hardware-specific barrier instructions. Key features: * Thread synchronization across thread blocks * Memory transfer completion tracking * Hardware-accelerated barrier operations * Support for phased synchronization This barrier is particularly useful for ensuring that shared memory operations complete before dependent computations begin, which is critical for maintaining data consistency in high-performance GPU kernels. ## Fields * ​mbar (`SIMD[int64, 1]`): Shared memory location used for the barrier state. This field stores an 8-byte aligned shared memory location that maintains the state of the barrier. The memory must be in shared address space to be accessible by all threads in a block. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `init` `init(ref [3] self, num_threads: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](1))` Initialize the barrier state with the expected number of threads. Sets up the barrier to expect arrivals from the specified number of threads before it can be satisfied. This is essential for coordinating thread synchronization in GPU kernels. **Args:** * ​num\_threads (`SIMD[int32, 1]`): Number of threads that must arrive at the barrier before it is satisfied. Defaults to 1. ### `expect_bytes` `expect_bytes(ref [3] self, bytes: SIMD[int32, 1])` Configure the barrier to expect a specific number of bytes to be transferred. Used with TMA operations to indicate the expected size of data transfer. The barrier will be satisfied when the specified number of bytes has been transferred, enabling efficient coordination of memory operations. **Args:** * ​bytes (`SIMD[int32, 1]`): Number of bytes expected to be transferred. ### `wait` `wait(ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))` Wait until the barrier is satisfied. Blocks the calling thread until the barrier is satisfied, either by the expected number of threads arriving or the expected data transfer completing. This method implements an efficient spin-wait mechanism optimized for GPU execution. Note: Minimizes thread divergence during synchronization by using hardware-accelerated barrier instructions. **Args:** * ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0. ### `unsafe_ptr` `unsafe_ptr(ref [3] self) -> UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8, mut=self_is_mut, origin=self_is_origin]` Get an unsafe pointer to the barrier's memory location. Provides low-level access to the shared memory location storing the barrier state. This method is primarily used internally by other barrier operations that need direct access to the underlying memory. **Returns:** An unsafe pointer to the barrier's memory location in shared memory, properly typed and aligned for barrier operations. ### `arrive_cluster` `arrive_cluster(ref [3] self, cta_id: SIMD[uint32, 1], count: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1))` Signal arrival at the barrier from a specific CTA (Cooperative Thread Array) in a cluster. This method is used in multi-CTA scenarios to coordinate barrier arrivals across different CTAs within a cluster. It enables efficient synchronization across thread blocks in clustered execution models. **Args:** * ​cta\_id (`SIMD[uint32, 1]`): The ID of the CTA (Cooperative Thread Array) that is arriving. * ​count (`SIMD[uint32, 1]`): The number of arrivals to signal. Defaults to 1. ### `arrive` `arrive(ref [3] self) -> Int` Signal arrival at the barrier and return the arrival count. This method increments the arrival count at the barrier and returns the updated count. It's used to track how many threads have reached the synchronization point. **Returns:** The updated arrival count after this thread's arrival. --- ## shiftl `shiftl(a: Int, s: Int) -> Int` Shift left or right based on sign of shift amount. Performs a left shift if `s` is positive, or a right shift if `s` is negative. **Args:** * ​a (`Int`): The integer value to shift. * ​s (`Int`): The shift amount. Positive for left, negative for right. **Returns:** The shifted integer value. `shiftl(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Shift left/right based on sign of shift for scalars. Scalar version of `shiftl`. Left shift if `s` is positive, right shift if `s` is negative. **Args:** * ​a (`SIMD[dtype, 1]`): The scalar value to shift. * ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for left, negative right. **Returns:** The shifted scalar value. --- ## shiftr `shiftr(a: Int, s: Int) -> Int` Shift right or left based on sign of shift amount. Performs a right shift if `s` is positive, or a left shift if `s` is negative. **Args:** * ​a (`Int`): The integer value to shift. * ​s (`Int`): The shift amount. Positive for right, negative for left. **Returns:** The shifted integer value. `shiftr(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Shift right/left based on sign of shift for scalars. Scalar version of `shiftr`. Right shift if `s` is positive, left shift if `s` is negative. **Args:** * ​a (`SIMD[dtype, 1]`): The scalar value to shift. * ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for right, negative left. **Returns:** The shifted scalar value. --- ## shuffle `shuffle[T: Copyable & Movable, //](mut list: List[T])` Shuffles the elements of the list randomly. Performs an in-place Fisher-Yates shuffle on the provided list. **Parameters:** * ​T (`Copyable & Movable`): The type of element in the List. **Args:** * ​list (`List[T]`): The list to modify. --- ## shuffle_down `shuffle_down[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with higher lane IDs in the warp. Performs a shuffle operation where each thread receives a value from a thread with a higher lane ID, offset by the specified amount. Uses the full warp mask by default. For example, with offset=1: * Thread 0 gets value from thread 1 * Thread 1 gets value from thread 2 * Thread N gets value from thread N+1 * Last N threads get undefined values **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive. **Returns:** The SIMD value from the thread offset lanes higher in the warp. Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE. `shuffle_down[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with higher lane IDs in the warp using a custom mask. Performs a shuffle operation where each thread receives a value from a thread with a higher lane ID, offset by the specified amount. The mask parameter controls which threads participate in the shuffle. For example, with offset=1: * Thread 0 gets value from thread 1 * Thread 1 gets value from thread 2 * Thread N gets value from thread N+1 * Last N threads get undefined values **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): A bitmask controlling which threads participate in the shuffle. Only threads with their corresponding bit set will exchange values. * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive. **Returns:** The SIMD value from the thread offset lanes higher in the warp. Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE or where the corresponding mask bit is not set. --- ## shuffle_idx `shuffle_idx[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies a value from a source lane to other lanes in a warp. ``` Broadcasts a value from a source thread in a warp to all participating threads without using shared memory. This is a convenience wrapper that uses the full warp mask by default. ``` Example: ```mojo from gpu.warp import shuffle_idx val = SIMD[DType.float32, 16](1.0) # Broadcast value from lane 0 to all lanes result = shuffle_idx(val, 0) # Get value from lane 5 result = shuffle_idx(val, 5) ``` . **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane. * ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from. **Returns:** A SIMD vector where all lanes contain the value from the source lane specified by offset. `shuffle_idx[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies a value from a source lane to other lanes in a warp with explicit mask control. ``` Broadcasts a value from a source thread in a warp to participating threads specified by the mask. This provides fine-grained control over which threads participate in the shuffle operation. ``` Example: ```mojo from gpu.warp import shuffle_idx # Only broadcast to first 16 lanes var mask = 0xFFFF # 16 ones var val = SIMD[DType.float32, 32](1.0) var result = shuffle_idx(mask, val, 5) ``` . **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): A bit mask specifying which lanes participate in the shuffle (1 bit per lane). * ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane. * ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from. **Returns:** A SIMD vector where participating lanes (set in mask) contain the value from the source lane specified by offset. Non-participating lanes retain their original values. --- ## shuffle_up `shuffle_up[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with lower lane IDs in the warp. Performs a shuffle operation where each thread receives a value from a thread with a lower lane ID, offset by the specified amount. Uses the full warp mask by default. For example, with offset=1: * Thread N gets value from thread N-1 * Thread 1 gets value from thread 0 * Thread 0 gets undefined value **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by. **Returns:** The SIMD value from the thread offset lanes lower in the warp. Returns undefined values for threads where lane\_id - offset `shuffle_up[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Copies values from threads with lower lane IDs in the warp. Performs a shuffle operation where each thread receives a value from a thread with a lower lane ID, offset by the specified amount. The operation is performed only for threads specified in the mask. For example, with offset=1: * Thread N gets value from thread N-1 if both threads are in the mask * Thread 1 gets value from thread 0 if both threads are in the mask * Thread 0 gets undefined value * Threads not in the mask get undefined values **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): The warp mask specifying which threads participate in the shuffle. * ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp. * ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by. **Returns:** The SIMD value from the thread offset lanes lower in the warp. Returns undefined values for threads where lane\_id - offset --- ## shuffle_xor `shuffle_xor[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Exchanges values between threads in a warp using a butterfly pattern. Performs a butterfly exchange pattern where each thread swaps values with another thread whose lane ID differs by a bitwise XOR with the given offset. This creates a butterfly communication pattern useful for parallel reductions and scans. **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread. * ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine the exchange partner. Common values are powers of 2 for butterfly patterns. **Returns:** The SIMD value from the thread at lane (current\_lane XOR offset). `shuffle_xor[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]` Exchanges values between threads in a warp using a butterfly pattern with masking. Performs a butterfly exchange pattern where each thread swaps values with another thread whose lane ID differs by a bitwise XOR with the given offset. The mask parameter allows controlling which threads participate in the exchange. Example: ```mojo from gpu.warp import shuffle_xor # Exchange values between even-numbered threads 4 lanes apart mask = 0xAAAAAAAA # Even threads only var val = SIMD[DType.float32, 16](42.0) # Example value result = shuffle_xor(mask, val, 4.0) ``` . **Parameters:** * ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in each SIMD vector. **Args:** * ​mask (`UInt`): A bit mask specifying which threads participate in the exchange. Only threads with their corresponding bit set in the mask will exchange values. * ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread. * ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine the exchange partner. Common values are powers of 2 for butterfly patterns. **Returns:** The SIMD value from the thread at lane (current\_lane XOR offset) if both threads are enabled by the mask, otherwise the original value is preserved. --- ## sign `sign[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]` Compute the sign (0, 1) of the input value. **Parameters:** * ​type (`DType`): DType used for the computation. * ​simd\_width (`Int`): SIMD width used for the computation. **Args:** * ​x (`SIMD[type, simd_width]`): The value to compute the sign operation on. **Returns:** The result of the sign operation. --- ## Signal `@register_passable(trivial)` `struct Signal` A synchronization primitive for coordinating GPU thread blocks across multiple devices. This struct provides counter-based synchronization between thread blocks on different GPUs. It maintains two sets of counters: 1. self\_counter: Used by blocks on the current GPU to signal their progress 2. peer\_counter: Used to track progress of blocks on other GPUs Note: The counters use unsigned integers that may overflow, but this is safe since unsigned integer overflow has well-defined behavior. ## Fields * ​self\_counter (`StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512]`): A 2D array of counters with shape (MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Each counter tracks the progress of a specific thread block on the current GPU. Thread blocks increment their corresponding counter to signal completion of a phase, allowing other GPUs to detect when synchronization points are reached. The counters use atomic operations to ensure proper synchronization across devices. * ​peer\_counter (`StaticTuple[StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512], 2]`): A 3D array of counters with shape (2, MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Contains two sets of counters to handle two synchronization points safely. The dual counter design prevents race conditions where a peer block arrives at the second sync point before the current block passes the first sync point. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` --- ## signum `signum(a: Int) -> Int` Calculate the sign of an integer. This function determines the sign of the input integer and returns a corresponding indicator value. Example: ```mojo from layout.int_tuple import signum var result1 = signum(5) # Returns 1 var result2 = signum(-10) # Returns -1 var result3 = signum(0) # Returns 0 ``` . **Args:** * ​a (`Int`): The integer value to determine the sign of. **Returns:** 1 if `a` > 0, -1 if `a` --- ## signum `signum(a: Int) -> Int` Returns the sign of an integer value. This helper function determines whether a number is positive, negative, or zero, returning 1 for positive, -1 for negative, and 0 for zero. **Args:** * ​a (`Int`): The integer value to determine the sign of. **Returns:** 1 if a > 0, -1 if a --- ## simd Implements SIMD primitives and abstractions. Provides high-performance SIMD primitives and abstractions for vectorized computation in Mojo. It enables efficient data-parallel operations by leveraging hardware vector processing units across different architectures. Key Features: 1. Architecture-agnostic SIMD abstractions with automatic hardware detection 2. Optimized vector operations for common numerical computations 3. Explicit control over vectorization strategies and memory layouts 4. Zero-cost abstractions that compile to efficient machine code 5. Support for different vector widths and element types Primary Components: * Vector types: Strongly-typed vector containers with element-wise operations * SIMD intrinsics: Low-level access to hardware SIMD instructions * Vectorized algorithms: Common algorithms optimized for SIMD execution * Memory utilities: Aligned memory allocation and vector load/store operations Performance Considerations: * Vector width selection should match target hardware capabilities * Memory alignment affects load/store performance * Data layout transformations may be necessary for optimal vectorization Integration: This module is designed to work seamlessly with other Mojo numerical computing components, including tensor operations, linear algebra routines, and domain-specific libraries for machine learning and scientific computing. ## Aliases ### `BFloat16` `alias BFloat16 = SIMD[bfloat16, 1]` Represents a 16-bit brain floating point value. ### `Byte` `alias Byte = SIMD[uint8, 1]` Represents a byte (backed by an 8-bit unsigned integer). ### `Float16` `alias Float16 = SIMD[float16, 1]` Represents a 16-bit floating point value. ### `Float32` `alias Float32 = SIMD[float32, 1]` Represents a 32-bit floating point value. ### `Float64` `alias Float64 = SIMD[float64, 1]` Represents a 64-bit floating point value. ### `Float8_e4m3fn` `alias Float8_e4m3fn = SIMD[float8_e4m3fn, 1]` Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1). This type is named differently across libraries and vendors, for example: * Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`. * OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`. In these contexts, they are all referring to the same finite type specified in the OFP8 standard above, encoded as `seeeemmm`: * (s)ign: 1 bit * (e)xponent: 4 bits * (m)antissa: 3 bits * exponent bias: 7 * nan: 01111111, 11111111 * -0: 10000000 * fn: finite (no inf or -inf encodings) ### `Float8_e4m3fnuz` `alias Float8_e4m3fnuz = SIMD[float8_e4m3fnuz, 1]` Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `Float8_e5m2` `alias Float8_e5m2 = SIMD[float8_e5m2, 1]` Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000 ### `Float8_e5m2fnuz` `alias Float8_e5m2fnuz = SIMD[float8_e5m2fnuz, 1]` Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding) ### `Int128` `alias Int128 = SIMD[si128, 1]` Represents a 128-bit signed scalar integer. ### `Int16` `alias Int16 = SIMD[int16, 1]` Represents a 16-bit signed scalar integer. ### `Int256` `alias Int256 = SIMD[si256, 1]` Represents a 256-bit signed scalar integer. ### `Int32` `alias Int32 = SIMD[int32, 1]` Represents a 32-bit signed scalar integer. ### `Int64` `alias Int64 = SIMD[int64, 1]` Represents a 64-bit signed scalar integer. ### `Int8` `alias Int8 = SIMD[int8, 1]` Represents an 8-bit signed scalar integer. ### `Scalar` `alias Scalar = SIMD[?, 1]` Represents a scalar dtype. ### `UInt128` `alias UInt128 = SIMD[ui128, 1]` Represents a 128-bit unsigned scalar integer. ### `UInt16` `alias UInt16 = SIMD[uint16, 1]` Represents a 16-bit unsigned scalar integer. ### `UInt256` `alias UInt256 = SIMD[ui256, 1]` Represents a 256-bit unsigned scalar integer. ### `UInt32` `alias UInt32 = SIMD[uint32, 1]` Represents a 32-bit unsigned scalar integer. ### `UInt64` `alias UInt64 = SIMD[uint64, 1]` Represents a 64-bit unsigned scalar integer. ### `UInt8` `alias UInt8 = SIMD[uint8, 1]` Represents an 8-bit unsigned scalar integer. ## Structs * [​`SIMD`](/mojo/stdlib/builtin/simd/SIMD): Represents a small vector that is backed by a hardware vector element. --- ## SIMD `@register_passable(trivial)` `struct SIMD[dtype: DType, size: Int]` Represents a small vector that is backed by a hardware vector element. SIMD allows a single instruction to be executed across the multiple data elements of the vector. **Constraints:** The size of the SIMD vector to be positive and a power of 2. ## Parameters * ​dtype (`DType`): The data type of SIMD vector elements. * ​size (`Int`): The size of the SIMD vector. ## Fields * ​value (`simd, #lit.struct.extract>`): The underlying storage for the vector. ## Implemented traits `Absable`, `AnyType`, `Boolable`, `CeilDivable`, `Ceilable`, `Copyable`, `DevicePassable`, `ExplicitlyCopyable`, `Floatable`, `Floorable`, `Hashable`, `Indexer`, `Intable`, `Movable`, `PythonConvertible`, `Representable`, `Roundable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `device_type` `alias device_type = SIMD[dtype, size]` SIMD types are remapped to the same type when passed to accelerator devices. ### `element_type` `alias element_type = dtype` ### `MAX` `alias MAX = SIMD(max_or_inf[::DType]())` Gets the maximum value for the SIMD value, potentially +inf. ### `MAX_FINITE` `alias MAX_FINITE = SIMD(max_finite[::DType]())` Returns the maximum finite value of SIMD value. ### `MIN` `alias MIN = SIMD(min_or_neg_inf[::DType]())` Gets the minimum value for the SIMD value, potentially -inf. ### `MIN_FINITE` `alias MIN_FINITE = SIMD(min_finite[::DType]())` Returns the minimum (lowest) finite value of SIMD value. ## Methods ### `__init__` `__init__() -> Self` Default initializer of the SIMD vector. By default the SIMD vectors are initialized to all zeros. `__init__[other_dtype: DType, //](value: SIMD[other_dtype, size], /) -> Self` Initialize from another SIMD of the same size. If the value passed is a scalar, you can initialize a SIMD vector with more elements. Example: ```mojo print(UInt64(UInt8(42))) # 42 print(SIMD[DType.uint64, 4](UInt8(42))) # [42, 42, 42, 42] ``` Casting behavior: ```mojo # Basic casting preserves value within range Int8(UInt8(127)) == Int8(127) # Numbers above signed max wrap to negative using two's complement Int8(UInt8(128)) == Int8(-128) Int8(UInt8(129)) == Int8(-127) Int8(UInt8(256)) == Int8(0) # Negative signed cast to unsigned using two's complement UInt8(Int8(-128)) == UInt8(128) UInt8(Int8(-127)) == UInt8(129) UInt8(Int8(-1)) == UInt8(255) # Truncate precision after downcast and upcast Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0) # Rightmost bits of significand become 0's on upcast Float64(Float32(0.3)) == Float64(0.30000001192092896) # Numbers equal after truncation of float literal and cast truncation Float32(Float64(123456789.123456789)) == Float32(123456789.123456789) # Float to int/uint floors Int64(Float64(42.2)) == Int64(42) ``` . **Parameters:** * ​other\_dtype (`DType`): The type of the value that is being cast from. **Args:** * ​value (`SIMD[other_dtype, size]`): The value to cast from. `@implicit` `__init__(value: UInt, /) -> Self` Initializes the SIMD vector with an unsigned integer. The unsigned integer value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`UInt`): The input value. `@implicit` `__init__(value: Int, /) -> Self` Initializes the SIMD vector with a signed integer. The signed integer value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`Int`): The input value. `__init__[T: Floatable, //](value: T, /) -> SIMD[float64, 1]` Initialize a Float64 from a type conforming to Floatable. **Parameters:** * ​T (`Floatable`): The Floatable type. **Args:** * ​value (`T`): The object to get the float point representation of. `__init__[T: FloatableRaising, //](out self: SIMD[float64, 1], value: T, /)` Initialize a Float64 from a type conforming to FloatableRaising. **Parameters:** * ​T (`FloatableRaising`): The FloatableRaising type. **Args:** * ​value (`T`): The object to get the float point representation of. **Raises:** If the type does not have a float point representation. `__init__[*, _: Int = 0](out self: SIMD[float64, 1], value: PythonObject, /)` Initialize a Float64 from a PythonObject. **Parameters:** * ​\_ (`Int`): A dummy parameter to ensure this overload has lower priority than the others. Its value is ignored. **Args:** * ​value (`PythonObject`): The PythonObject to convert. **Raises:** If the conversion to double fails. `@implicit` `__init__(value: IntLiteral[value], /) -> Self` Initializes the SIMD vector with an integer. The integer value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`IntLiteral[value]`): The input value. `@implicit` `__init__(value: Bool, /) -> SIMD[bool, size]` Initializes the SIMD vector with a bool value. The bool value is splatted across all elements of the SIMD vector. **Args:** * ​value (`Bool`): The bool value. `@implicit` `__init__(value: simd, #lit.struct.extract>, /) -> Self` Initializes the SIMD vector with the underlying mlir value. **Args:** * ​value (`simd, #lit.struct.extract>`): The input value. `@implicit` `__init__(value: SIMD[dtype, 1], /) -> Self` Constructs a SIMD vector by splatting a scalar value. The input value is splatted across all elements of the SIMD vector. **Args:** * ​value (`SIMD[dtype, 1]`): The value to splat to the elements of the vector. `__init__(*elems: SIMD[dtype, 1]) -> Self` Constructs a SIMD vector via a variadic list of elements. The input values are assigned to the corresponding elements of the SIMD vector. **Constraints:** The number of input values is equal to size of the SIMD vector. **Args:** * ​\*elems (`SIMD[dtype, 1]`): The variadic list of elements from which the SIMD vector is constructed. `@implicit` `__init__(value: FloatLiteral[value], /) -> Self` Initializes the SIMD vector with a float. The value is splatted across all the elements of the SIMD vector. **Args:** * ​value (`FloatLiteral[value]`): The input value. ### `__bool__` `__bool__(self) -> Bool` Converts the SIMD scalar into a boolean value. **Constraints:** The size of the SIMD vector must be 1. **Returns:** True if the SIMD scalar is non-zero and False otherwise. ### `__getitem__` `__getitem__(self, idx: Int) -> SIMD[dtype, 1]` Gets an element from the vector. **Args:** * ​idx (`Int`): The element index. **Returns:** The value at position `idx`. ### `__setitem__` `__setitem__(mut self, idx: Int, val: SIMD[dtype, 1])` Sets an element in the vector. **Args:** * ​idx (`Int`): The index to set. * ​val (`SIMD[dtype, 1]`): The value to set. ### `__neg__` `__neg__(self) -> Self` Defines the unary `-` operation. **Returns:** The negation of this SIMD vector. ### `__pos__` `__pos__(self) -> Self` Defines the unary `+` operation. **Returns:** This SIMD vector. ### `__invert__` `__invert__(self) -> Self` Returns `~self`. **Constraints:** The element type of the SIMD vector must be boolean or integral. **Returns:** The `~self` value. ### `__lt__` `__lt__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using less-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] ### `__le__` `__le__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using less-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] ### `__eq__` `__eq__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using equal-to comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] == rhs[i]`. ### `__ne__` `__ne__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using not-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] != rhs[i]`. ### `__gt__` `__gt__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using greater-than comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] > rhs[i]`. ### `__ge__` `__ge__(self, rhs: Self) -> SIMD[bool, size]` Compares two SIMD vectors using greater-than-or-equal comparison. **Args:** * ​rhs (`Self`): The rhs of the operation. **Returns:** A new bool SIMD vector of the same size whose element at position `i` is True or False depending on the expression `self[i] >= rhs[i]`. ### `__contains__` `__contains__(self, value: SIMD[dtype, 1]) -> Bool` Whether the vector contains the value. **Args:** * ​value (`SIMD[dtype, 1]`): The value. **Returns:** Whether the vector contains the value. ### `__add__` `__add__(self, rhs: Self) -> Self` Computes `self + rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] + rhs[i]`. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Computes `self - rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] - rhs[i]`. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Computes `self * rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] * rhs[i]`. ### `__truediv__` `__truediv__(self, rhs: Self) -> Self` Computes `self / rhs`. **Args:** * ​rhs (`Self`): The rhs value. **Returns:** A new vector whose element at position `i` is computed as `self[i] / rhs[i]`. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Returns the division of self and rhs rounded down to the nearest integer. **Constraints:** The element type of the SIMD vector must be numeric. **Args:** * ​rhs (`Self`): The value to divide with. **Returns:** `floor(self / rhs)` value. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Returns the remainder of self divided by rhs. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Int) -> Self` Computes the vector raised to the power of the input integer value. **Args:** * ​exp (`Int`): The exponent value. **Returns:** A SIMD vector where each element is raised to the power of the specified exponent value. `__pow__(self, exp: Self) -> Self` Computes the vector raised elementwise to the right hand side power. **Args:** * ​exp (`Self`): The exponent value. **Returns:** A SIMD vector where each element is raised to the power of the specified exponent value. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Returns `self rhs (`Self`): The RHS value. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Returns `self >> rhs`. **Constraints:** The element type of the SIMD vector must be integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: Self) -> Self` Returns `self & rhs`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Returns `self | rhs`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Returns `self ^ rhs`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self ^ rhs`. ### `__radd__` `__radd__(self, value: Self) -> Self` Returns `value + self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value + self`. ### `__rsub__` `__rsub__(self, value: Self) -> Self` Returns `value - self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value - self`. ### `__rmul__` `__rmul__(self, value: Self) -> Self` Returns `value * self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value * self`. ### `__rtruediv__` `__rtruediv__(self, value: Self) -> Self` Returns `value / self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value / self`. ### `__rfloordiv__` `__rfloordiv__(self, rhs: Self) -> Self` Returns the division of rhs and self rounded down to the nearest integer. **Constraints:** The element type of the SIMD vector must be numeric. **Args:** * ​rhs (`Self`): The value to divide by self. **Returns:** `floor(rhs / self)` value. ### `__rmod__` `__rmod__(self, value: Self) -> Self` Returns `value mod self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value mod self`. ### `__rpow__` `__rpow__(self, base: Self) -> Self` Returns `base ** self`. **Args:** * ​base (`Self`): The base value. **Returns:** `base ** self`. ### `__rlshift__` `__rlshift__(self, value: Self) -> Self` Returns `value value (`Self`): The other value. **Returns:** `value ### `__rrshift__` `__rrshift__(self, value: Self) -> Self` Returns `value >> self`. **Constraints:** The element type of the SIMD vector must be integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value >> self`. ### `__rand__` `__rand__(self, value: Self) -> Self` Returns `value & self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value & self`. ### `__ror__` `__ror__(self, value: Self) -> Self` Returns `value | self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value | self`. ### `__rxor__` `__rxor__(self, value: Self) -> Self` Returns `value ^ self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​value (`Self`): The other value. **Returns:** `value ^ self`. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Performs in-place addition. The vector is mutated where each element at position `i` is computed as `self[i] + rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the addition operation. ### `__isub__` `__isub__(mut self, rhs: Self)` Performs in-place subtraction. The vector is mutated where each element at position `i` is computed as `self[i] - rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__imul__` `__imul__(mut self, rhs: Self)` Performs in-place multiplication. The vector is mutated where each element at position `i` is computed as `self[i] * rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` In-place true divide operator. The vector is mutated where each element at position `i` is computed as `self[i] / rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` In-place flood div operator. The vector is mutated where each element at position `i` is computed as `self[i] // rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__imod__` `__imod__(mut self, rhs: Self)` In-place mod operator. The vector is mutated where each element at position `i` is computed as `self[i] % rhs[i]`. **Args:** * ​rhs (`Self`): The rhs of the operation. ### `__ipow__` `__ipow__(mut self, rhs: Int)` In-place pow operator. The vector is mutated where each element at position `i` is computed as `pow(self[i], rhs)`. **Args:** * ​rhs (`Int`): The rhs of the operation. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Computes `self rhs (`Self`): The RHS value. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Computes `self >> rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be integral. **Args:** * ​rhs (`Self`): The RHS value. ### `__iand__` `__iand__(mut self, rhs: Self)` Computes `self & rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Computes `self ^ rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. ### `__ior__` `__ior__(mut self, rhs: Self)` Computes `self | rhs` and save the result in `self`. **Constraints:** The element type of the SIMD vector must be bool or integral. **Args:** * ​rhs (`Self`): The RHS value. ### `get_type_name` `static get_type_name() -> String` Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `get_device_type_name` `static get_device_type_name() -> String` Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls. **Returns:** This type's name. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `from_bits` `static from_bits[int_dtype: DType, //](value: SIMD[int_dtype, size]) -> Self` Initializes the SIMD vector from the bits of an integral SIMD vector. **Parameters:** * ​int\_dtype (`DType`): The integral type of the input SIMD vector. **Args:** * ​value (`SIMD[int_dtype, size]`): The SIMD vector to copy the bits from. **Returns:** The bitcast SIMD vector. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `__len__` `__len__(self) -> Int` Gets the length of the SIMD vector. **Returns:** The length of the SIMD vector. ### `__int__` `__int__(self) -> Int` Casts to the value to an Int. If there is a fractional component, then the fractional part is truncated. **Constraints:** The size of the SIMD vector must be 1. **Returns:** The value as an integer. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Casts the value to a float. **Constraints:** The size of the SIMD vector must be 1. **Returns:** The value as a float. ### `__str__` `__str__(self) -> String` Get the SIMD as a string. **Returns:** A string representation. ### `__repr__` `__repr__(self) -> String` Get the representation of the SIMD value e.g. "SIMD\[DType.int8, 2]\(1, 2)". **Returns:** The representation of the SIMD value. ### `__floor__` `__floor__(self) -> Self` Performs elementwise floor on the elements of a SIMD vector. **Returns:** The elementwise floor of this SIMD vector. ### `__ceil__` `__ceil__(self) -> Self` Performs elementwise ceiling on the elements of a SIMD vector. **Returns:** The elementwise ceiling of this SIMD vector. ### `__trunc__` `__trunc__(self) -> Self` Performs elementwise truncation on the elements of a SIMD vector. **Returns:** The elementwise truncated values of this SIMD vector. ### `__abs__` `__abs__(self) -> Self` Defines the absolute value operation. **Returns:** The absolute value of this SIMD vector. ### `__round__` `__round__(self) -> Self` Performs elementwise rounding on the elements of a SIMD vector. This rounding goes to the nearest integer with ties away from zero. **Returns:** The elementwise rounded value of this SIMD vector. `__round__(self, ndigits: Int) -> Self` Performs elementwise rounding on the elements of a SIMD vector. This rounding goes to the nearest integer with ties away from zero. **Args:** * ​ndigits (`Int`): The number of digits to round to. **Returns:** The elementwise rounded value of this SIMD vector. ### `__hash__` `__hash__(self) -> UInt` Hash the value using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this SIMD value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `__ceildiv__` `__ceildiv__(self, denominator: Self) -> Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `cast` `cast[target: DType](self) -> SIMD[target, size]` Casts the elements of the SIMD vector to the target element type. Casting behavior: ```mojo # Basic casting preserves value within range Int8(UInt8(127)) == Int8(127) # Numbers above signed max wrap to negative using two's complement Int8(UInt8(128)) == Int8(-128) Int8(UInt8(129)) == Int8(-127) Int8(UInt8(256)) == Int8(0) # Negative signed cast to unsigned using two's complement UInt8(Int8(-128)) == UInt8(128) UInt8(Int8(-127)) == UInt8(129) UInt8(Int8(-1)) == UInt8(255) # Truncate precision after downcast and upcast Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0) # Rightmost bits of significand become 0's on upcast Float64(Float32(0.3)) == Float64(0.30000001192092896) # Numbers equal after truncation of float literal and cast truncation Float32(Float64(123456789.123456789)) == Float32(123456789.123456789) # Float to int/uint floors Int64(Float64(42.2)) == Int64(42) ``` . **Parameters:** * ​target (`DType`): The target DType. **Returns:** A new SIMD vector whose elements have been casted to the target element type. ### `is_power_of_two` `is_power_of_two(self) -> SIMD[bool, size]` Checks if the input value is a power of 2 for each element of a SIMD vector. **Constraints:** The element type of the input vector must be integral. **Returns:** A SIMD value where the element at position `i` is True if the integer at position `i` of the input value is a power of 2, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this SIMD value to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `to_bits` `to_bits[int_dtype: DType = _integral_type_of[::DType]()](self) -> SIMD[int_dtype, size]` Bitcasts the SIMD vector to an integer SIMD vector. **Parameters:** * ​int\_dtype (`DType`): The integer type to cast to. **Returns:** An integer representation of the floating-point value. ### `from_bytes` `static from_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](bytes: InlineArray[SIMD[uint8, 1], dtype.sizeof()]) -> SIMD[dtype, 1]` Converts a byte array to an scalar integer. **Parameters:** * ​big\_endian (`Bool`): Whether the byte array is big-endian. **Args:** * ​bytes (`InlineArray[SIMD[uint8, 1], dtype.sizeof()]`): The byte array to convert. **Returns:** The integer value. ### `as_bytes` `as_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](self) -> InlineArray[SIMD[uint8, 1], dtype.sizeof()]` Convert the scalar integer to a byte array. **Parameters:** * ​big\_endian (`Bool`): Whether the byte array should be big-endian. **Returns:** The byte array. ### `clamp` `clamp(self, lower_bound: Self, upper_bound: Self) -> Self` Clamps the values in a SIMD vector to be in a certain range. Clamp cuts values in the input SIMD vector off at the upper bound and lower bound values. For example, SIMD vector `[0, 1, 2, 3]` clamped to a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`. **Args:** * ​lower\_bound (`Self`): Minimum of the range to clamp to. * ​upper\_bound (`Self`): Maximum of the range to clamp to. **Returns:** A new SIMD vector containing x clamped to be within lower\_bound and upper\_bound. ### `fma` `fma(self, multiplier: Self, accumulator: Self) -> Self` Performs a fused multiply-add operation, i.e. `self*multiplier + accumulator`. **Args:** * ​multiplier (`Self`): The value to multiply. * ​accumulator (`Self`): The value to accumulate. **Returns:** A new vector whose element at position `i` is computed as `self[i]*multiplier[i] + accumulator[i]`. ### `shuffle` `shuffle[*mask: Int](self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​\*mask (`Int`): The permutation to use in the shuffle. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self)[permutation[i]]`. `shuffle[*mask: Int](self, other: Self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​\*mask (`Int`): The permutation to use in the shuffle. **Args:** * ​other (`Self`): The other vector to shuffle with. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self + other)[permutation[i]]`. `shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self)[permutation[i]]`. `shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self, other: Self) -> Self` Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`. **Parameters:** * ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle. **Args:** * ​other (`Self`): The other vector to shuffle with. **Returns:** A new vector with the same length as the mask where the value at position `i` is `(self + other)[permutation[i]]`. ### `slice` `slice[output_width: Int, /, *, offset: Int = 0](self) -> SIMD[dtype, output_width]` Returns a slice of the vector of the specified width with the given offset. **Constraints:** `output_width + offset` must not exceed the size of this SIMD vector. **Parameters:** * ​output\_width (`Int`): The output SIMD vector size. * ​offset (`Int`): The given offset for the slice. **Returns:** A new vector whose elements map to `self[offset:offset+output_width]`. ### `insert` `insert[*, offset: Int = 0](self, value: SIMD[dtype, size]) -> Self` Returns a new vector where the elements between `offset` and `offset + input_width` have been replaced with the elements in `value`. **Parameters:** * ​offset (`Int`): The offset to insert at. **Args:** * ​value (`SIMD[dtype, size]`): The value to be inserted. **Returns:** A new vector whose elements at `self[offset:offset+input_width]` contain the values of `value`. ### `join` `join(self, other: Self) -> SIMD[dtype, (size * 2)]` Concatenates the two vectors together. **Args:** * ​other (`Self`): The other SIMD vector. **Returns:** A new vector `self_0, self_1, ..., self_n, other_0, ..., other_n`. ### `interleave` `interleave(self, other: Self) -> SIMD[dtype, (size * 2)]` Constructs a vector by interleaving two input vectors. **Args:** * ​other (`Self`): The other SIMD vector. **Returns:** A new vector `self_0, other_0, ..., self_n, other_n`. ### `split` `split(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]` Splits the SIMD vector into 2 subvectors. **Returns:** A new vector `self_0:N/2, self_N/2:N`. ### `deinterleave` `deinterleave(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]` Constructs two vectors by deinterleaving the even and odd lanes of the vector. **Constraints:** The vector size must be greater than 1. **Returns:** Two vectors the first of the form `self_0, self_2, ..., self_{n-2}` and the other being `self_1, self_3, ..., self_{n-1}`. ### `reduce` `reduce[func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using a provided reduce operator. **Constraints:** `size_out` must not exceed width of the vector. **Parameters:** * ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): The reduce function to apply to elements in this SIMD. * ​size\_out (`Int`): The width of the reduction. **Returns:** A new scalar which is the reduction of all vector elements. ### `reduce_max` `reduce_max[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `max` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or FP. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The maximum element of the vector. ### `reduce_min` `reduce_min[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `min` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or FP. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The minimum element of the vector. ### `reduce_add` `reduce_add[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `add` operator. **Constraints:** `size_out` must not exceed width of the vector. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The sum of all vector elements. ### `reduce_mul` `reduce_mul[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the `mul` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or FP. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The product of all vector elements. ### `reduce_and` `reduce_and[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the bitwise `&` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or boolean. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The reduced vector. ### `reduce_or` `reduce_or[size_out: Int = 1](self) -> SIMD[dtype, size_out]` Reduces the vector using the bitwise `|` operator. **Constraints:** `size_out` must not exceed width of the vector. The element type of the vector must be integer or boolean. **Parameters:** * ​size\_out (`Int`): The width of the reduction. **Returns:** The reduced vector. ### `reduce_bit_count` `reduce_bit_count(self) -> Int` Returns the total number of bits set in the SIMD vector. **Constraints:** Must be either an integral or a boolean type. **Returns:** Count of set bits across all elements of the vector. ### `select` `select[result_dtype: DType](self, true_case: SIMD[result_dtype, size], false_case: SIMD[result_dtype, size]) -> SIMD[result_dtype, size]` Selects the values of the `true_case` or the `false_case` based on the current boolean values of the SIMD vector. **Constraints:** The element type of the vector must be boolean. **Parameters:** * ​result\_dtype (`DType`): The element type of the input and output SIMD vectors. **Args:** * ​true\_case (`SIMD[result_dtype, size]`): The values selected if the positional value is True. * ​false\_case (`SIMD[result_dtype, size]`): The values selected if the positional value is False. **Returns:** A new vector of the form `[true_case[i] if elem else false_case[i] for i, elem in enumerate(self)]`. ### `rotate_left` `rotate_left[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the left by `shift` elements (with wrap-around). **Constraints:** `-size shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the left (with wrap-around). **Returns:** The SIMD vector rotated to the left by `shift` elements (with wrap-around). ### `rotate_right` `rotate_right[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the right by `shift` elements (with wrap-around). **Constraints:** `-size shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the right (with wrap-around). **Returns:** The SIMD vector rotated to the right by `shift` elements (with wrap-around). ### `shift_left` `shift_left[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the left by `shift` elements (no wrap-around, fill with zero). **Constraints:** `0 shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the left (no wrap-around, fill with zero). **Returns:** The SIMD vector rotated to the left by `shift` elements (no wrap-around, fill with zero). ### `shift_right` `shift_right[shift: Int](self) -> Self` Shifts the elements of a SIMD vector to the right by `shift` elements (no wrap-around, fill with zero). **Constraints:** `0 shift (`Int`): The number of positions by which to rotate the elements of SIMD vector to the right (no wrap-around, fill with zero). **Returns:** The SIMD vector rotated to the right by `shift` elements (no wrap-around, fill with zero). ### `reversed` `reversed(self) -> Self` Reverses the SIMD vector by indexes. Examples: ```mojo print(SIMD[DType.uint8, 4](1, 2, 3, 4).reversed()) # [4, 3, 2, 1] ``` . **Returns:** The by index reversed vector. --- ## simdbitwidth `simdbitwidth[target: target = _current_target()]() -> Int` Returns the vector size (in bits) of the specified target. **Parameters:** * ​target (`target`): The target architecture. **Returns:** The vector size (in bits) of the specified target. --- ## simdbytewidth `simdbytewidth[target: target = _current_target()]() -> Int` Returns the vector size (in bytes) of the specified target. **Parameters:** * ​target (`target`): The target architecture. **Returns:** The vector size (in bytes) of the host system. --- ## simdwidthof `simdwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int` Returns the vector size of the type on the host system. **Parameters:** * ​type (`AnyTrivialRegType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The vector size of the type on the host system. `simdwidthof[dtype: DType, target: target = _current_target()]() -> Int` Returns the vector size of the type on the host system. **Parameters:** * ​dtype (`DType`): The DType in question. * ​target (`target`): The target architecture. **Returns:** The vector size of the dtype on the host system. --- ## sin `sin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `sin` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `sin` of the input. --- ## sinh `sinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `sinh` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `sinh` of the input. --- ## size `size(a: IntTuple[origin]) -> Int` Calculate the total size (product of all elements) of an `IntTuple`. This function computes the product of all integer values in the `IntTuple`, regardless of nesting level. **Args:** * ​a (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied together. **Returns:** The product of all elements in the `IntTuple`. --- ## size `size(l: Layout) -> Int` Returns the total number of elements in the layout's domain. This is a standalone function equivalent to the Layout.size() method. **Args:** * ​l (`Layout`): The layout to calculate the size for. **Returns:** The total number of elements in the layout. --- ## Sized The `Sized` trait describes a type that has an integer length (such as a string or array). Any type that conforms to `Sized` or [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. The `Sized` trait requires a type to implement the `__len__()` method. For example: ```mojo struct Foo(Sized): var length: Int fn __len__(self) -> Int: return self.length ``` You can pass an instance of `Foo` to the `len()` function to get its length: ```mojo var foo = Foo(42) print(len(foo) == 42) ``` ```plaintext True ``` **Note:** If the `__len__()` method can raise an error, use the [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__len__` `__len__(self: _Self) -> Int` Get the length of the type. **Returns:** The length of the type. --- ## SizedRaising The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined. Any type that conforms to [`Sized`](/mojo/stdlib/builtin/len/Sized) or `SizedRaising` works with the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. The `SizedRaising` trait requires a type to implement the `__len__()` method, which can raise an error. For example: ```mojo struct Foo(SizedRaising): var length: Int fn __len__(self) raises -> Int: if self.length `__len__(self: _Self) -> Int` Get the length of the type. **Returns:** The length of the type. **Raises:** If the length cannot be computed. --- ## sizeof `sizeof[type: AnyType, target: target = _current_target()]() -> Int` Returns the size of (in bytes) of the type. Example: ```mojo from sys.info import sizeof def main(): print( sizeof[UInt8]() == 1, sizeof[UInt16]() == 2, sizeof[Int32]() == 4, sizeof[Float64]() == 8, sizeof[ SIMD[DType.uint8, 4] ]() == 4, ) ``` Note: `align_of` is in same module. **Parameters:** * ​type (`AnyType`): The type in question. * ​target (`target`): The target architecture. **Returns:** The size of the type in bytes. `sizeof[dtype: DType, target: target = _current_target()]() -> Int` Returns the size of (in bytes) of the dtype. **Parameters:** * ​dtype (`DType`): The DType in question. * ​target (`target`): The target architecture. **Returns:** The size of the dtype in bytes. --- ## sleep `sleep(sec: SIMD[float64, 1])` Suspends the current thread for the seconds specified. **Args:** * ​sec (`SIMD[float64, 1]`): The number of seconds to sleep for. `sleep(sec: UInt)` Suspends the current thread for the seconds specified. **Args:** * ​sec (`UInt`): The number of seconds to sleep for. --- ## slice ## Functions * [​`copy_to_slice`](./copy_to_slice): * [​`slice_as_copy`](./slice_as_copy): * [​`slice_as_view`](./slice_as_view): * [​`slice_dim_as_view`](./slice_dim_as_view): * [​`slice_shape`](./slice_shape): --- ## slice `slice(end: Int) -> Slice` Construct slice given the end value. **Args:** * ​end (`Int`): The end value. **Returns:** The constructed slice. `slice(start: Int, end: Int) -> Slice` Construct slice given the start and end values. **Args:** * ​start (`Int`): The start value. * ​end (`Int`): The end value. **Returns:** The constructed slice. `slice(start: Optional[Int], end: Optional[Int], step: Optional[Int]) -> Slice` Construct a Slice given the start, end and step values. **Args:** * ​start (`Optional[Int]`): The start value. * ​end (`Optional[Int]`): The end value. * ​step (`Optional[Int]`): The step value. **Returns:** The constructed slice. --- ## Slice `struct Slice` Represents a slice expression. Objects of this type are generated when slice syntax is used within square brackets, e.g.: ```mojo var msg: String = "Hello Mojo" # Both are equivalent and print "Mojo". print(msg[6:]) print(msg.__getitem__(Slice(6, len(msg)))) ``` ## Fields * ​start (`Optional[Int]`): The starting index of the slice. * ​end (`Optional[Int]`): The end index of the slice. * ​step (`Optional[Int]`): The step increment value of the slice. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, start: Int, end: Int)` Construct slice given the start and end values. **Args:** * ​start (`Int`): The start value. * ​end (`Int`): The end value. `__init__(out self, start: Optional[Int], end: Optional[Int], step: Optional[Int])` Construct slice given the start, end and step values. **Args:** * ​start (`Optional[Int]`): The start value. * ​end (`Optional[Int]`): The end value. * ​step (`Optional[Int]`): The step value. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compare this slice to the other. **Args:** * ​other (`Self`): The slice to compare to. **Returns:** True if start, end, and step values of this slice match the corresponding values of the other slice and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compare this slice to the other. **Args:** * ​other (`Self`): The slice to compare to. **Returns:** False if start, end, and step values of this slice match the corresponding values of the other slice and True otherwise. ### `copy` `copy(self) -> Self` Creates a deep copy of the Slice. **Returns:** A copy of the value. ### `__str__` `__str__(self) -> String` Gets the string representation of the span. **Returns:** The string representation of the span. ### `__repr__` `__repr__(self) -> String` Gets the string representation of the span. **Returns:** The string representation of the span. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write Slice string representation to a `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `indices` `indices(self, length: Int) -> Tuple[Int, Int, Int]` Returns a tuple of 3 integers representing the start, end, and step of the slice if applied to a container of the given length. Uses the target container length to normalize negative, out of bounds, or None indices. Negative indices are wrapped using the length of the container. ```mojo s = slice(0, -1, 1) i = s.indices(5) # returns (0, 4, 1) ``` None indices are defaulted to the start or the end of the container based on whether `step` is positive or negative. ```mojo s = slice(None, None, 1) i = s.indices(5) # returns (0, 5, 1) ``` Out of bounds indices are clamped using the size of the container. ```mojo s = slice(20) i = s.indices(5) # returns (0, 5, 1) ``` **Args:** * ​length (`Int`): The length of the target container. **Returns:** A tuple containing three integers for start, end, and step. --- ## slice_as_copy `slice_as_copy[type: DType, index_type: DType, in_rank: Int](output: NDBuffer[type, in_rank, origin], tensor: NDBuffer[type, in_rank, origin], start: NDBuffer[index_type, 1, origin], end: NDBuffer[index_type, 1, origin], step: NDBuffer[index_type, 1, origin])` --- ## slice_as_view `slice_as_view[type: DType, start_type: DType, end_type: DType, step_type: DType, rank: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], starts: NDBuffer[start_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ends: NDBuffer[end_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], steps: NDBuffer[step_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> NDBuffer[type, rank, origin]` --- ## slice_dim_as_view `slice_dim_as_view[type: DType, rank: Int, dim: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start: Int, end: Int, step: Int) -> NDBuffer[type, rank, origin]` --- ## slice_shape `slice_shape[input_rank: Int, input_type: DType, start_type: DType, stop_type: DType, step_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], start_buf: NDBuffer[start_type, 1, origin], stop_buf: NDBuffer[stop_type, 1, origin], step_buf: NDBuffer[step_type, 1, origin]) -> IndexList[input_rank]` --- ## SlidingWindowCausalMask `@register_passable(trivial)` `struct SlidingWindowCausalMask[window_size: Int]` Mask implementing Sliding Window attention. Considering the following case: * Q\_len = 7 * K\_len = 7 * window\_size = 3 The mask will be applied as follows: K > 0 1 2 3 4 5 6 Q v x------------x 0 | 1 0 0 0 0 0 0 1 | 1 1 0 0 0 0 0 2 | 1 1 1 0 0 0 0 3 | 0 1 1 1 0 0 0 4 | 0 0 1 1 1 0 0 5 | 0 0 0 1 1 1 0 6 | 0 0 0 0 1 1 1 ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAMask`, `Movable`, `UnknownDestructibility` ## Aliases ### `apply_log2e_after_mask` `alias apply_log2e_after_mask = False` ### `mask_out_of_bound` `alias mask_out_of_bound = True` ### `mask_safe_out_of_bounds` `alias mask_safe_out_of_bounds = True` ## Methods ### `mask` `mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]` ### `status` `status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus` --- ## sm_id `sm_id() -> UInt` Returns the Streaming Multiprocessor (SM) ID of the current thread. The SM ID uniquely identifies which physical streaming multiprocessor the thread is executing on. This is useful for SM-level optimizations and understanding hardware utilization. If called on non-NVIDIA GPUs, this function aborts as this functionality is only supported on NVIDIA hardware. **Returns:** The SM ID of the current thread. --- ## softmax ## Functions * [​`identity`](./identity): * [​`logsoftmax`](./logsoftmax): Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm. * [​`mul`](./mul): * [​`reciprocal`](./reciprocal): * [​`reduce_add_simd`](./reduce_add_simd): This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize. * [​`softmax`](./softmax): * [​`softmax_2_pass`](./softmax_2_pass): Performs an unbatched softmax on an input tensor using the two-pass online algorithm. * [​`softmax_3_pass`](./softmax_3_pass): Performs an unbatched softmax on an input tensor using the three-pass algorithm. * [​`softmax_kernel`](./softmax_kernel): * [​`sub`](./sub): --- ## softmax `softmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)` `softmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int, context: DeviceContextPtr = DeviceContextPtr())` --- ## softmax_2_pass `softmax_2_pass[simd_width: Int, buffer_size: Dim, type: DType](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)], input: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])` Performs an unbatched softmax on an input tensor using the two-pass online algorithm. The unbatched two-pass online softmax is described in "Online normalizer calculation for softmax" () and "A full-stack search technique for domain optimized deep learning accelerators" () and is defined as: procedure SoftmaxUnbatched(InputInput) runningMax = -∞ runningSum = 0 STAGE 1: for i = 0 to N do newMax = max(runningMax, Input\[i]) runningSum = runningSum\*exp(runningMax-newMax) + exp(Input\[i]-newMax) runningMax = newMax end for for i = 0 to N do Output\[i] = exp(Input\[i] - runningMax) / runningSum end for **Parameters:** * ​simd\_width (`Int`): The simd\_width to use in vectorization. * ​buffer\_size (`Dim`): The size of the input and output buffers. * ​type (`DType`): The type of the input and output buffers. **Args:** * ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values. * ​input (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The input buffer used to compute the softmax. --- ## softmax_3_pass `softmax_3_pass[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])` Performs an unbatched softmax on an input tensor using the three-pass algorithm. The unbatched three-pass softmax is defined as: procedure SoftmaxUnbatched(InputInput) maxVal = -∞ denom = 0 STEP 1: find the max value in each batch for i = 0 to N do maxVal = max(maxVal, Input\[b, i]) end for STEP 2: compute the exponential for each batch for i = 0 to N do Output\[b, i] = exp(Input\[b, i] - maxVal) denom += Output\[b, i] end for STEP 3: normalize each batch for i = 0 to N do Output\[b, i] /= denom end for **Parameters:** * ​simd\_width (`Int`): The simd\_width to use in vectorization. * ​buffer\_size (`Dim`): The size of the input and output buffers. * ​type (`DType`): The type of the input and output buffers. * ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d. * ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda. **Args:** * ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values. --- ## softmax_kernel `softmax_kernel[: origin.set, //, BLOCK_SIZE: Int, input_fn: fn[DType, Int, Int](IndexList[$2]) capturing -> SIMD[$0, $1], type: DType, rank: Int, accum_type: DType = get_accum_type[::DType,::DType]()](shape: IndexList[rank], output: NDBuffer[type, rank, MutableAnyOrigin])` --- ## sort Implements the built-in `sort` function. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `insertion_sort_threshold` `alias insertion_sort_threshold = 32` ## Functions * [​`partition`](/mojo/stdlib/builtin/sort/partition): Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is --- ## sort `sort[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool, *, stable: Bool = False](span: Span[T, origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​T (`Copyable & Movable`): Copyable & Movable type of the underlying data. * ​origin (`MutableOrigin`): Origin of span. * ​cmp\_fn (`fn(T, T) capturing -> Bool`): The comparison function. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[T, origin]`): The span to be sorted. `sort[: origin.set, origin: MutableOrigin, //, cmp_fn: fn(Int, Int) capturing -> Bool, *, stable: Bool = False](span: Span[Int, origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​origin (`MutableOrigin`): Origin of span. * ​cmp\_fn (`fn(Int, Int) capturing -> Bool`): The comparison function. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[Int, origin]`): The span to be sorted. `sort[origin: MutableOrigin, //, *, stable: Bool = False](span: Span[Int, origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​origin (`MutableOrigin`): Origin of span. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[Int, origin]`): The span to be sorted. `sort[dtype: DType, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[SIMD[dtype, 1], origin])` Sort the list inplace. The function doesn't return anything, the list is updated inplace. **Parameters:** * ​dtype (`DType`): Copyable & Movable type of the underlying data. * ​origin (`MutableOrigin`): Origin of span. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[SIMD[dtype, 1], origin]`): The span to be sorted. `sort[T: Copyable & Movable & Comparable, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[T, origin])` Sort list of the order comparable elements in-place. **Parameters:** * ​T (`Copyable & Movable & Comparable`): The order comparable collection element type. * ​origin (`MutableOrigin`): Origin of span. * ​stable (`Bool`): Whether the sort should be stable. **Args:** * ​span (`Span[T, origin]`): The span to be sorted. --- ## sort_buf_descending `sort_buf_descending[type: DType, out_idx_type: DType, rank: Int, //](mut buf_keys: NDBuffer[type, rank, origin], mut buf_ids: NDBuffer[out_idx_type, rank, origin], vocab_size: Int)` Sort each batch separately in descending order using parallel merge sort. --- ## sorted `sorted[cmp: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool = __lt__[::Origin[::Bool[?, ?]](tuple: IntTuple[origin]) -> IntTuple` Sort an IntTuple using the provided comparison function. This function implements a merge sort algorithm to efficiently sort the elements of an IntTuple. The sorting is stable and has `O(n log n)` time complexity. **Parameters:** * ​cmp (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A comparison function that takes two `IntTuple` elements and returns True if the first should come before the second. Defaults to the `lt` function which performs lexicographical ordering. **Args:** * ​tuple (`IntTuple[origin]`): The `IntTuple` to be sorted. **Returns:** A new `IntTuple` containing the same elements as the input but sorted according to the comparison function. --- ## span Implements the `Span` type. You can import these APIs from the `memory` module. For example: ```mojo from memory import Span ``` ## Structs * [​`Span`](/mojo/stdlib/memory/span/Span): A non-owning view of contiguous data. --- ## Span `@register_passable(trivial)` `struct Span[mut: Bool, //, T: Copyable & Movable, origin: Origin[mut], *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType]()]` A non-owning view of contiguous data. ## Parameters * ​mut (`Bool`): Whether the span is mutable. * ​T (`Copyable & Movable`): The type of the elements in the span. * ​origin (`Origin[mut]`): The origin of the Span. * ​address\_space (`AddressSpace`): The address space associated with the allocated memory. * ​alignment (`Int`): The minimum alignment of the underlying pointer known statically. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `Immutable` `alias Immutable = Span[T, (muttoimm origin._mlir_origin)]` The immutable version of the `Span`. ### `Mutable` `alias Mutable = Span[T, (mutcast origin._mlir_origin)]` The mutable version of the `Span`. ## Methods ### `__init__` `__init__() -> Self` Create an empty / zero-length span. `__init__(*, ptr: UnsafePointer[T, address_space=address_space, alignment=alignment], length: UInt) -> Self` Unsafe construction from a pointer and length. **Args:** * ​ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment]`): The underlying pointer of the span. * ​length (`UInt`): The length of the view. `@implicit` `__init__(ref [origin, address_space] list: List[T, hint_trivial_type]) -> Self` Construct a `Span` from a `List`. **Args:** * ​list (`List[T, hint_trivial_type]`): The list to which the span refers. `@implicit` `__init__[size: Int, //](ref [origin] array: InlineArray[T, size]) -> Self` Construct a `Span` from an `InlineArray`. **Parameters:** * ​size (`Int`): The size of the `InlineArray`. **Args:** * ​array (`InlineArray[T, size]`): The array to which the span refers. ### `__bool__` `__bool__(self) -> Bool` Check if a span is non-empty. **Returns:** True if a span is non-empty, False otherwise. ### `__getitem__` `__getitem__[I: Indexer](self, idx: I) -> ref [origin, address_space] T` Get a reference to an element in the span. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index of the value to return. **Returns:** An element reference. `__getitem__(self, slc: Slice) -> Self` Get a new span from a slice of the current span. Allocation: This function allocates when the step is negative, to avoid a memory leak, take ownership of the value. **Args:** * ​slc (`Slice`): The slice specifying the range of the new subslice. **Returns:** A new span that points to the same data as the current span. ### `__eq__` `__eq__[T: EqualityComparable & Copyable & Movable, rhs_alignment: Int, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin, alignment=rhs_alignment]) -> Bool` Verify if span is equal to another span. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the traits `EqualityComparable`, `Copyable` and `Movable`. * ​rhs\_alignment (`Int`): The inferred alignment of the rhs span. **Args:** * ​rhs (`Span[T, origin, alignment=rhs_alignment]`): The span to compare against. **Returns:** True if the spans are equal in length and contain the same elements, False otherwise. ### `__ne__` `__ne__[T: EqualityComparable & Copyable & Movable, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin]) -> Bool` Verify if span is not equal to another span. **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the traits `EqualityComparable`, `Copyable` and `Movable`. **Args:** * ​rhs (`Span[T, origin]`): The span to compare against. **Returns:** True if the spans are not equal in length or contents, False otherwise. ### `__contains__` `__contains__[dtype: DType, //](self: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], value: SIMD[dtype, 1]) -> Bool` Verify if a given value is present in the Span. **Parameters:** * ​dtype (`DType`): The DType of the scalars stored in the Span. **Args:** * ​value (`SIMD[dtype, 1]`): The value to find. **Returns:** True if the value is contained in the list, False otherwise. ### `copy` `copy(self) -> Self` Explicitly construct a copy of the provided `Span`. **Returns:** A copy of the `Span`. ### `__iter__` `__iter__(self) -> _SpanIter[T, origin, address_space=address_space, alignment=alignment]` Get an iterator over the elements of the `Span`. **Returns:** An iterator over the elements of the `Span`. ### `__reversed__` `__reversed__(self) -> _SpanIter[T, origin, False, address_space, alignment]` Iterate backwards over the `Span`. **Returns:** A reversed iterator of the `Span` elements. ### `__len__` `__len__(self) -> Int` Returns the length of the span. This is a known constant value. **Returns:** The size of the span. ### `get_immutable` `get_immutable(self) -> Span[T, (muttoimm origin._mlir_origin)]` Return an immutable version of this `Span`. **Returns:** An immutable version of the same `Span`. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Retrieves a pointer to the underlying memory. **Returns:** The pointer to the underlying memory. ### `as_ref` `as_ref(self) -> Pointer[T, origin, address_space]` Gets a `Pointer` to the first element of this span. **Returns:** A `Pointer` pointing at the first element of this span. ### `copy_from` `copy_from[origin: MutableOrigin, other_alignment: Int, //](self: Span[T, origin, alignment=alignment], other: Span[T, origin, alignment=other_alignment])` Performs an element wise copy from all elements of `other` into all elements of `self`. **Parameters:** * ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span. * ​other\_alignment (`Int`): The inferred alignment of the data within the Span. **Args:** * ​other (`Span[T, origin, alignment=other_alignment]`): The `Span` to copy all elements from. ### `fill` `fill[origin: MutableOrigin, //](self: Span[T, origin, alignment=alignment], value: T)` Fill the memory that a span references with a given value. **Parameters:** * ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span. **Args:** * ​value (`T`): The value to assign to each element. ### `swap_elements` `swap_elements(self: Span[T, origin, alignment=alignment], a: UInt, b: UInt)` Swap the values at indices `a` and `b`. **Args:** * ​a (`UInt`): The first argument index. * ​b (`UInt`): The second argument index. **Raises:** If a or b are larger than the length of the span. ### `__merge_with__` `__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]](self) -> Span[T, origin, address_space=address_space, alignment=alignment]` Returns a pointer merged with the specified `other_type`. **Parameters:** * ​other\_type (`AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]`): The type of the pointer to merge with. **Returns:** A pointer merged with the specified `other_type`. --- ## Speculative decoding import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import TutorialStack from '@site/src/components/TutorialStack'; Speculative decoding is an algorithm designed to accelerate the decoding process for large language models without sacrificing the quality of the generated text or requiring modifications to the models themselves. This technique employs a smaller, faster **draft model** to generate several potential next tokens in parallel, which are then efficiently validated against a larger, more powerful target model using a modified rejection sampling technique. This leads to reduced overall latency and improved throughput during token generation. By accepting correct predictions and only resampling when necessary, speculative decoding achieves a significant speedup in token generation, effectively bypassing memory bandwidth limitations often encountered during standard autoregressive decoding. :::caution Speculative decoding with MAX is still in preview and some aspects may change as we refine the implementation. Expect **ongoing** improvements and potential adjustments based on feedback and performance optimizations. ::: ## When to use speculative decoding You'll want to use speculative decoding when your primary goal is to accelerate the decoding process of large language models and reduce latency. For example, if you are using a 405 billion parameter model, you can use speculative decoding to reduce latency by using a 135 million parameter draft model. ## How speculative decoding works By default, speculative decoding is disabled in MAX. It can be enabled using the `--draft-model-path` flag. This flag takes a path to a model that will be used to generate speculative tokens. This is the model name as it appears on Hugging Face or as a path to a local directory containing a model. All model-specific parameters can be prefixed with `--draft-` to configure the draft model independently from the main model. For example: - `--draft-model-path`: Path to the draft model - `--draft-quantization-encoding`: Quantization encoding for the draft model - `--draft-weight-path`: Path to draft model weights The performance of speculative decoding primarily depends on two factors: - **Acceptance rate**: How often the target model confirms the draft model's predictions. - **Token generation pattern**: The system is optimized when more draft tokens can be evaluated in a single step of the target model. This is controlled by the `--max-num-steps` parameter, which sets the maximum number of tokens the draft model generates before verification by the target model. ## Quickstart You can use speculative decoding with MAX to accelerate model inference by using a smaller draft model to predict tokens that are verified by the main model. Serve your model with MAX and specify the draft model path using the `--draft-model-path` flag: ```sh max serve --model-path HuggingFaceTB/SmolLM2-360M-Instruct \ --draft-model-path HuggingFaceTB/SmolLM2-135M-Instruct \ --device-memory-utilization=0.6 \ --max-num-steps=5 \ --no-enable-chunked-prefill ``` The endpoint is ready when you see the URI printed in your terminal: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` Once the model is served, you can make requests to the API endpoints. Install the `openai` package: ```sh pip install openai ``` Then create a new Python file and import the `openai` package: ```python from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", # Your MAX endpoint api_key="not-needed" # API key can be any string when using MAX locally ) # Make a chat completion request response = client.chat.completions.create( model="HuggingFaceTB/SmolLM2-360M-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the benefits of speculative decoding?"} ], max_tokens=500 ) # Print the response print(response.choices[0].message.content) ``` In a new terminal, make a chat completion request using curl: ```sh curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "HuggingFaceTB/SmolLM2-360M-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the benefits of speculative decoding?"} ], "max_tokens": 500 }' ``` You can also use the `generate` command to generate text: ```sh max generate --model-path HuggingFaceTB/SmolLM2-360M-Instruct \ --draft-model-path HuggingFaceTB/SmolLM-135M \ --max-length=200 \ --prompt="What are the benefits of speculative decoding?" \ --device-memory-utilization=0.6 \ --devices=gpu \ --no-enable-chunked-prefill ``` ## Next steps Now that you know the basics of speculative decoding, you can get started with MAX on GPUs. export const tutorials = [ 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes', ]; --- ## SpinWaiter `struct SpinWaiter` A proxy for the C++ runtime's SpinWaiter type. ## Fields * ​storage (`UnsafePointer[NoneType]`): Pointer to the underlying SpinWaiter instance. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initializes a SpinWaiter instance. ### `__del__` `__del__(owned self)` Destroys the SpinWaiter instance. ### `wait` `wait(self)` Blocks the current task for a duration determined by the underlying policy. --- ## split ## Functions * [​`split`](./split): --- ## split `split[type: DType, rank: Int, num_outputs: Int, target: StringSlice[StaticConstantOrigin], trace_description: StringSlice[StaticConstantOrigin]](input: NDBuffer[type, rank, origin], axis: Int, outputs: StaticTuple[NDBuffer[type, rank, MutableAnyOrigin], num_outputs], ctx: DeviceContext)` --- ## split `split[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]` Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to be split. **Returns:** A tuple containing two strings: (head, tail). --- ## split_extension `split_extension[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]` Splits `path` into the root and extension. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to be split. **Returns:** A tuple containing two strings: (root, extension). --- ## split_k_reduce `split_k_reduce[c_type: DType, work_space_type: DType, c_shape: DimList, work_space_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], work_space: NDBuffer[work_space_type, 3, origin, work_space_shape], ctx: DeviceContext)` --- ## SplitKPartition `@register_passable(trivial)` `struct SplitKPartition[dtype: DType]` ## Fields * ​ptr (`UnsafePointer[SIMD[dtype, 1]]`): * ​num\_partitions\_value (`SIMD[uint32, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHAPartitionScheme`, `Movable`, `UnknownDestructibility` ## Aliases ### `accum_dtype` `alias accum_dtype = dtype` ### `do_partition` `alias do_partition = True` ## Methods ### `__init__` `__init__(ptr: UnsafePointer[SIMD[dtype, 1]], num_partitions_value: SIMD[uint32, 1]) -> Self` ### `num_partitions` `num_partitions(self) -> SIMD[uint32, 1]` ### `get_exp_sum_qk_max_pointer` `get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]` --- ## splitroot `splitroot[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String, String]` Splits `path` into drive, root and tail. The tail contains anything after the root. **Parameters:** * ​PathLike (`PathLike`): The type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to be split. **Returns:** A tuple containing three strings: (drive, root, tail). --- ## sqrt `sqrt(x: Int) -> Int` Performs square root on an integer. **Args:** * ​x (`Int`): The integer value to perform square root on. **Returns:** The square root of x. `sqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise square root on the elements of a SIMD vector. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector to perform square root on. **Returns:** The elementwise square root of x. --- ## st_matrix `st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)], d: SIMD[float32, simd_width])` Performs warp-synchronized copy from registers to shared memory. This function stores data from registers to shared memory in a format that can be directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store. Note: The function performs a warp-synchronized operation - all threads in the warp must execute this instruction to avoid deadlock. **Constraints:** * Must be used with shared memory pointers. * Number of registers must be 1, 2, or 4. * Data must be properly aligned for matrix operations. * All threads in warp must participate. * Only supported on NVIDIA GPUs with tensor core capabilities. **Parameters:** * ​dtype (`DType`): Data type of elements to store. * ​simd\_width (`Int`): Width of the SIMD vector. * ​transpose (`Bool`): If True, transposes the matrix during store. **Args:** * ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory where data will be stored. * ​d (`SIMD[float32, simd_width]`): SIMD vector containing the data to store. --- ## st_matrix_n_atom `st_matrix_n_atom[num_stmatrix: Int]() -> Layout` Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix. The domain of this layout is the warp group local thread index. Thus, the layout takes \[0, 128) as input and returns an offset for a logical array with an element size of 128-bit. **Parameters:** * ​num\_stmatrix (`Int`): Number of N-dimension tiles in the C matrix. **Returns:** `Layout` - A layout that maps warp group local thread index to an offset for a logical array with an element size of 128-bit. --- ## st_matrix_n_layout `st_matrix_n_layout[c_type: DType, WG_BN: Int, num_m_mmas: Int, num_consumer: Int]() -> Layout` Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix. The layout modes are: the warp group local thread index, the N-dimension tiling size `WG_BN // 16`, the number of MMA tiles `num_m_mmas` in the M-dimension, and the number of consumers `num_consumer`. The output is an offset for a logical array with the element type `c_type`. **Parameters:** * ​c\_type (`DType`): Data type of the C matrix. * ​WG\_BN (`Int`): Size of the K dimension in the C matrix in shared memory. * ​num\_m\_mmas (`Int`): Number of MMA tiles in the M dimension. * ​num\_consumer (`Int`): Number of consumers. **Returns:** `Layout` - A layout that maps warp group local thread index to an offset for a logical array with the element type `c_type`. --- ## stack_allocation `stack_allocation[count: Int, dtype: DType, /, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[SIMD[dtype, 1], address_space=address_space]` Allocates data buffer space on the stack given a data type and number of elements. **Parameters:** * ​count (`Int`): Number of elements to allocate memory for. * ​dtype (`DType`): The data type of each element. * ​alignment (`Int`): Address alignment of the allocated data. * ​address\_space (`AddressSpace`): The address space of the pointer. **Returns:** A data pointer of the given type pointing to the allocated space. `stack_allocation[count: Int, type: AnyType, /, name: Optional[StringSlice[StaticConstantOrigin]] = Optional(None), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[type, address_space=address_space]` Allocates data buffer space on the stack given a data type and number of elements. **Parameters:** * ​count (`Int`): Number of elements to allocate memory for. * ​type (`AnyType`): The data type of each element. * ​name (`Optional[StringSlice[StaticConstantOrigin]]`): The name of the global variable (only honored in certain cases). * ​alignment (`Int`): Address alignment of the allocated data. * ​address\_space (`AddressSpace`): The address space of the pointer. **Returns:** A data pointer of the given type pointing to the allocated space. --- ## stack_allocation_like `stack_allocation_like[layout: Layout, dtype: DType, *, address_space: AddressSpace, target_address_space: AddressSpace = AddressSpace(0)](in_tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=target_address_space, masked=masked]` Create a stack-allocated tensor with the same layout as an existing tensor. This function creates a new tensor on the stack with the same layout, data type, and masking properties as the input tensor, but potentially with a different address space. This is useful for creating temporary tensors that match the structure of existing tensors. Example: ```mojo from layout import LayoutTensor, Layout from layout.layout_tensor import stack_allocation_like var global_tensor = LayoutTensor[DType.float32, Layout((10, 10)), address_space=AddressSpace.GLOBAL]() var stack_tensor: LayoutTensor[DType.float32, Layout((10, 10)), MutableAnyOrigin, address_space=AddressSpace.GENERIC] stack_allocation_like(global_tensor, stack_tensor) ``` Performance: * Creates a tensor on the stack, which is typically faster to allocate and access than heap-allocated memory. * Stack allocations have automatic lifetime management, reducing memory management overhead. * Stack size is limited, so be cautious with large tensor allocations. Notes: * The new tensor will have the same layout, data type, and masking properties as the input tensor. * The address space can be changed, which is useful for moving data between different memory regions (e.g., from global to shared memory). * Stack allocations are automatically freed when they go out of scope. * The function uses the stack\_allocation method of the result tensor type. **Parameters:** * ​layout (`Layout`): The layout of the tensor to allocate. * ​dtype (`DType`): The data type of the tensor elements. * ​address\_space (`AddressSpace`): The address space of the input tensor. * ​target\_address\_space (`AddressSpace`): The address space for the new tensor. Defaults to GENERIC. **Args:** * ​in\_tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to match the layout of. **Returns:** A new tensor allocated on the stack with the same layout as the input tensor. --- ## Start a chat endpoint import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import TutorialStack from '@site/src/components/TutorialStack'; import InstallModular from '@site/docs/_includes/install-modular.mdx'; import Requirements from '@site/src/components/Requirements'; import { requirementsNoGPU } from '@site/docs/max/requirements'; The MAX framework simplifies the process to serve open source models with the same API interface as OpenAI. This allows you to replace commercial models with alternatives from the [MAX Builds](https://builds.modular.com/?category=models) site with minimal code changes. This tutorial shows you how to serve Llama 3.1 locally with the `max` CLI and interact with it through REST and Python APIs. You'll learn to configure the server and make requests using the OpenAI client libraries as a drop-in replacement. System requirements: ## Set up your environment Create a Python project to install our APIs and CLI tools: ## Serve your model Use the [`max serve`](/max/max-cli/#serve) command to start a local model server with the Llama 3.1 model: ```bash max serve \ --model-path modularai/Llama-3.1-8B-Instruct-GGUF ``` While this example uses the Llama 3.1 model, you can replace it with any of the models listed in the [MAX Builds](https://builds.modular.com/?category=models) site. :::note When searching for a model using the MAX Builds site, ensure that the model type can fit into memory of your machine. You can filter and sort models by hardware type, and size of the model. For more information and to learn how to use the MAX Builds site, see [MAX Builds in 60 seconds](https://www.youtube.com/watch?v=EqM1TB1GgCc). ::: The server is ready when you see a message indicating it's running on http://0.0.0.0:8000: ```output Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit) ``` For a complete list of `max` CLI commands and options, refer to the [MAX CLI reference](/max/max-cli). ## Interact with the model After the server is running, you can interact with the model using different methods. The MAX endpoint supports OpenAI REST APIs, so you can send requests from your client using the `openai` Python API. You can use OpenAI's Python client to interact with the model. To get started, install the OpenAI Python client: ```bash pip install openai ``` Then, create a client and make a request to the model: ```python title="generate-text.py" from openai import OpenAI client = OpenAI( base_url = 'http://0.0.0.0:8000/v1', api_key='EMPTY', # required by the API, but not used by MAX ) response = client.chat.completions.create( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"}, {"role": "assistant", "content": "The LA Dodgers won in 2020."}, {"role": "user", "content": "Where was it played?"} ] ) print(response.choices[0].message.content) ``` In this example, you're using the OpenAI Python client to interact with the MAX endpoint running on local host `8000`. The `client` object is initialized with the base URL `http://0.0.0.0:8000/v1` and the API key is ignored. When you run this code, the model should respond with information about the 2020 World Series location: ```sh python generate-text.py ``` ```output The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic. ``` The following `curl` command sends a simple chat request to the model's chat completions endpoint: ```bash curl http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello, how are you?" } ], "max_tokens": 100 }' ``` You should receive a response similar to this: ```json { "id": "18b0abd2d2fd463ea43efe2c147bcac0", "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": " I'm doing well, thank you for asking. How can I assist you today?", "refusal": "", "tool_calls": null, "role": "assistant", "function_call": null }, "logprobs": { "content": [], "refusal": [] } } ], "created": 1743543698, "model": "modularai/Llama-3.1-8B-Instruct-GGUF", "service_tier": null, "system_fingerprint": null, "object": "chat.completion", "usage": { "completion_tokens": 17, "prompt_tokens": null, "total_tokens": 17 } } ``` For complete details on all available API endpoints and options, see the [MAX Serve API documentation](/max/api/serve). ## Next steps Now that you have successfully set up MAX with OpenAI-compatible endpoints, checkout out these other tutorials: export const maxTutorials = [ 'deploy-llama-vision', 'run-embeddings-with-max-serve', ]; --- ## stat `stat[PathLike: PathLike](path: PathLike) -> stat_result` Get the status of a file or a file descriptor. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the directory. **Returns:** Returns the stat\_result on the path. --- ## stat Implements the stat package. ## Modules * [​`stat`](/mojo/stdlib/stat/stat/): Implements the stat module. --- ## stat Implements the stat module. ## Aliases ### `S_IFBLK` `alias S_IFBLK = 24576` Bits that determine the block device. ### `S_IFCHR` `alias S_IFCHR = 8192` Bits that determine the char device. ### `S_IFDIR` `alias S_IFDIR = 16384` Bits that determine the directory. ### `S_IFIFO` `alias S_IFIFO = 4096` Bits that determine the fifo. ### `S_IFLNK` `alias S_IFLNK = 40960` Bits that determine the symlink. ### `S_IFMT` `alias S_IFMT = 61440` Bits that determine the file type. ### `S_IFREG` `alias S_IFREG = 32768` Bits that determine the regular file. ### `S_IFSOCK` `alias S_IFSOCK = 49152` Bits that determine the socket. ## Functions * [​`S_ISBLK`](/mojo/stdlib/stat/stat/S_ISBLK): Returns True if the mode is a block device. * [​`S_ISCHR`](/mojo/stdlib/stat/stat/S_ISCHR): Returns True if the mode is a character device. * [​`S_ISDIR`](/mojo/stdlib/stat/stat/S_ISDIR): Returns True if the mode is a directory. * [​`S_ISFIFO`](/mojo/stdlib/stat/stat/S_ISFIFO): Returns True if the mode is a fifo. * [​`S_ISLNK`](/mojo/stdlib/stat/stat/S_ISLNK): Returns True if the mode is a symlink. * [​`S_ISREG`](/mojo/stdlib/stat/stat/S_ISREG): Returns True if the mode is a regular file. * [​`S_ISSOCK`](/mojo/stdlib/stat/stat/S_ISSOCK): Returns True if the mode is a socket. --- ## stat_result `struct stat_result` Object whose fields correspond to the members of the stat structure. ## Fields * ​st\_mode (`Int`): File mode: file type and file mode bits (permissions). * ​st\_ino (`Int`): Platform dependent, but if non-zero, uniquely identifies the file for a given value of st\_dev. * ​st\_dev (`Int`): Identifier of the device on which this file resides. * ​st\_nlink (`Int`): Number of hard links. * ​st\_uid (`Int`): User identifier of the file owner. * ​st\_gid (`Int`): Group identifier of the file owner. * ​st\_size (`Int`): Size of the file in bytes, if it is a regular file or a symbolic link. * ​st\_atimespec (`_CTimeSpec`): Time of file most recent access. * ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification. * ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change. * ​st\_birthtimespec (`_CTimeSpec`): Time of file creation. * ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file. * ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O. * ​st\_rdev (`Int`): Type of device if an inode device. * ​st\_flags (`Int`): User defined flags for file. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(out self, *, st_mode: Int, st_ino: Int, st_dev: Int, st_nlink: Int, st_uid: Int, st_gid: Int, st_size: Int, st_atimespec: _CTimeSpec, st_mtimespec: _CTimeSpec, st_ctimespec: _CTimeSpec, st_birthtimespec: _CTimeSpec, st_blocks: Int, st_blksize: Int, st_rdev: Int, st_flags: Int)` Initialize the stat\_result structure. **Args:** * ​st\_mode (`Int`): File mode: file type and file mode bits (permissions). * ​st\_ino (`Int`): Uniquely identifier for a file. * ​st\_dev (`Int`): Identifier of the device on which this file resides. * ​st\_nlink (`Int`): Number of hard links. * ​st\_uid (`Int`): User identifier of the file owner. * ​st\_gid (`Int`): Group identifier of the file owner. * ​st\_size (`Int`): Size of the file (bytes), if it is a file or a symlink. * ​st\_atimespec (`_CTimeSpec`): Time of file most recent access. * ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification. * ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change. * ​st\_birthtimespec (`_CTimeSpec`): Time of file creation. * ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file. * ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O. * ​st\_rdev (`Int`): Type of device if an inode device. * ​st\_flags (`Int`): User defined flags for file. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this path to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Constructs a string representation of stat\_result. **Returns:** A string representation of stat\_result. ### `__repr__` `__repr__(self) -> String` Constructs a representation of stat\_result. **Returns:** A representation of stat\_result. --- ## static `static[d: Int]() -> ValueOrUnknown[d]` Creates a static dimension with compile-time value. **Parameters:** * ​d (`Int`): The compile-time dimension value to use. **Returns:** `ValueOrUnknown[d]` - A static dimension with the given value. --- ## static_tuple Implements StaticTuple, a statically-sized uniform container. You can import these APIs from the `utils` package. For example: ```mojo from utils import StaticTuple ``` ## Structs * [​`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple): A statically sized tuple type which contains elements of homogeneous types. --- ## StaticInt `@register_passable(trivial)` `struct StaticInt[value: Int]` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Intable`, `Movable`, `OptionallyStaticInt`, `UnknownDestructibility` ## Aliases ### `static_value` `alias static_value = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int value, 0})` ## Methods ### `__init__` `__init__() -> Self` ### `__int__` `__int__(self) -> Int` ### `as_uint32` `as_uint32(self) -> SIMD[uint32, 1]` --- ## StaticTuple `@register_passable(trivial)` `struct StaticTuple[element_type: AnyTrivialRegType, size: Int]` A statically sized tuple type which contains elements of homogeneous types. ## Parameters * ​element\_type (`AnyTrivialRegType`): The type of the elements in the tuple. * ​size (`Int`): The size of the tuple. ## Fields * ​array (`array, element_type>`): The underlying storage for the static tuple. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `type` `alias type = array, element_type>` ## Methods ### `__init__` `__init__() -> Self` Constructs an empty (undefined) tuple. `@implicit` `__init__(array: array, element_type>) -> Self` Constructs from an array type. **Args:** * ​array (`array, element_type>`): Underlying MLIR array type. `@implicit` `__init__(*elems: element_type) -> Self` Constructs a static tuple given a set of arguments. **Args:** * ​\*elems (`element_type`): The element types. `@implicit` `__init__(values: VariadicList[element_type]) -> Self` Creates a tuple constant using the specified values. **Args:** * ​values (`VariadicList[element_type]`): The list of values. `__init__(*, other: Self) -> Self` Explicitly copy the provided StaticTuple. **Args:** * ​other (`Self`): The StaticTuple to copy. ### `__getitem__` `__getitem__[index: Int](self) -> element_type` Returns the value of the tuple at the given index. **Parameters:** * ​index (`Int`): The index into the tuple. **Returns:** The value at the specified position. `__getitem__[I: Indexer, //](self, idx: I) -> element_type` Returns the value of the tuple at the given dynamic index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index into the tuple. **Returns:** The value at the specified position. ### `__setitem__` `__setitem__[I: Indexer, //](mut self, idx: I, val: element_type)` Stores a single value into the tuple at the specified dynamic index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index into the tuple. * ​val (`element_type`): The value to store. `__setitem__[idx: Int](mut self, val: element_type)` Stores a single value into the tuple at the specified index. **Parameters:** * ​idx (`Int`): The index into the tuple. **Args:** * ​val (`element_type`): The value to store. ### `__len__` `__len__(self) -> Int` Returns the length of the array. This is a known constant value. **Returns:** The size of the list. --- ## store_matrix_d `store_matrix_d[dtype: DType, //, m: Int, n: Int, k: Int, n_blocks: Int = 1](d_ptr: UnsafePointer[SIMD[dtype, 1]], d: SIMD[dtype, 4], tile_row: Int, tile_col: Int, ldm: Int)` Stores matrix D tile from registers to memory after tensor core operation. This function dispatches to architecture-specific implementations for storing the results of a tensor core matrix multiply-accumulate operation. It handles the different memory layouts required by NVIDIA and AMD tensor cores. Note: * Automatically selects appropriate implementation based on GPU architecture. * Each thread stores 4 elements in architecture-specific positions. * Must be called by all threads in a warp. **Parameters:** * ​dtype (`DType`): Data type of the matrix elements. * ​m (`Int`): Number of rows in matrix D. * ​n (`Int`): Number of columns in matrix D. * ​k (`Int`): Inner dimension for matrix multiply. * ​n\_blocks (`Int`): Number of blocks. **Args:** * ​d\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): Pointer to destination memory for matrix D. * ​d (`SIMD[dtype, 4]`): SIMD vector containing 4 elements to store. * ​tile\_row (`Int`): Starting row index of the tile in matrix D. * ​tile\_col (`Int`): Starting column index of the tile in matrix D. * ​ldm (`Int`): Leading dimension (stride) of matrix D. --- ## store_release `store_release[type: DType, //, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])` Performs an atomic store with release memory ordering semantics. This function provides a memory barrier that ensures all previous memory operations from the calling thread are visible to other threads before this store is performed. Note: * Only supported on GPUs. * Maps directly to PTX st.release instruction on NVIDIA, LLVM atomic store on AMDGPU. * Ensures all previous memory operations complete before this store. * Critical for implementing synchronization primitives. **Parameters:** * ​type (`DType`): The data type to store. * ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM). * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to. * ​value (`SIMD[type, 1]`): Value to store. --- ## store_volatile `store_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])` Performs a volatile store operation that cannot be optimized away. This function guarantees that the store operation will be performed exactly as specified, without being reordered or optimized away by the compiler. Note: * Only supported on NVIDIA GPUs. * Maps directly to PTX st.volatile instruction. * Prevents compiler optimization of the store operation. * Useful for memory-mapped I/O or synchronization primitives. * May have performance implications compared to regular stores. **Parameters:** * ​type (`DType`): The data type to store. * ​memory (`Bool`): Whether to include memory side effects in constraints (default: True). **Args:** * ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to. * ​value (`SIMD[type, 1]`): Value to store. --- ## store_x `store_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## store_y `store_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## store_z `store_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)` --- ## str Provides the `str` function. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Stringable`](/mojo/stdlib/builtin/str/Stringable): The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). * [​`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising): The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). --- ## Streaming multiprocessor A streaming multiprocessor (SM) is the fundamental processing unit of a GPU, designed to execute multiple parallel workloads efficiently. Each SM contains several cores, which perform the actual computations of the [threads](thread.mdx) executing on the SM, along with shared resources like [registers](register.mdx), shared [memory](memory.mdx), and control mechanisms to coordinate the execution of threads. The number of SMs and the number of cores on a GPU depends on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating point cores per SM. --- ## strided_load `strided_load[dtype: DType, //, simd_width: Int, *, invariant: Bool = False](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True)) -> SIMD[dtype, simd_width]` Loads values from addr according to a specific stride. **Parameters:** * ​dtype (`DType`): DType of `value`, the value to store. * ​simd\_width (`Int`): The width of the SIMD vectors. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The memory location to load data from. * ​stride (`Int`): How many lanes to skip before loading again. * ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of `value`. **Returns:** A vector containing the loaded data. --- ## strided_store `strided_store[dtype: DType, //, simd_width: Int](value: SIMD[dtype, simd_width], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True))` Loads values from addr according to a specific stride. **Parameters:** * ​dtype (`DType`): DType of `value`, the value to store. * ​simd\_width (`Int`): The width of the SIMD vectors. **Args:** * ​value (`SIMD[dtype, simd_width]`): The values to store. * ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The location to store values at. * ​stride (`Int`): How many lanes to skip before storing again. * ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of `value`. --- ## string The string package provides comprehensive Unicode string handling functionality for Mojo. This package implements Unicode-aware string types and operations, with UTF-8 support. It includes efficient implementations for string manipulation, formatting, and Unicode operations while maintaining memory safety and performance. Key Components: * `String`: The main string type supporting UTF-8 encoded text, * `StringSlice`: Memory-efficient string view type for zero-copy operations * `Codepoint`: Unicode code point handling and operations * Format: String formatting and interpolation utilities Core Features: * Unicode support with UTF-8 encoding * Efficient string slicing and views * String formatting and interpolation * Memory-safe string operations * Unicode case conversion * Unicode property lookups and validation Example: ```mojo # Basic string creation and manipulation var s = String("Hello, 世界") var slice = s[0:5] # "Hello" # Unicode-aware operations for c in s.codepoints(): print(c.to_uppercase()) # String formatting var name = "Mojo" var formatted = String("Hello, {name}!") ``` Note: String stores data using UTF-8, and all operations (unless clearly noted) are intended to be fully Unicode compliant and maintain correct UTF-8 encoded data. A handful of operations are known to not be Unicode / UTF-8 compliant yet, but will be fixed as time permits. ## Modules * [​`codepoint`](/mojo/stdlib/collections/string/codepoint/): Unicode codepoint handling. * [​`format`](/mojo/stdlib/collections/string/format/): String formatting utilities for Mojo. * [​`string`](/mojo/stdlib/collections/string/string/): The core `String` type implementation for Mojo. * [​`string_slice`](/mojo/stdlib/collections/string/string_slice/): The `StringSlice` type implementation for efficient string operations. --- ## string The core `String` type implementation for Mojo. This module provides the primary `String` type and its fundamental operations. The `String` type is a mutable string, and is designed to handle UTF-8 encoded text efficiently while providing a safe and ergonomic interface for string manipulation. Related types: * [`StringSlice`](/mojo/stdlib/collections/string/string_slice/). A non-owning view of string data, which can be either mutable or immutable. * [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases). An alias for an immutable constant `StringSlice`. * [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/). A string literal. String literals are compile-time values. For use at runtime, you usually want wrap a `StringLiteral` in a `String` (for a mutable string) or `StaticString` (for an immutable constant string). Key Features: * Short string optimization (SSO) and lazy copying of constant string data. * O(1) copy operation. * Memory-safe string operations. * Efficient string concatenation and slicing. * String-to-number conversions ( [`atof()`](/mojo/stdlib/collections/string/string/atof), [`atol()`](/mojo/stdlib/collections/string/string/atol)). * Character code conversions ( [`chr()`](/mojo/stdlib/collections/string/string/chr), [`ord()`](/mojo/stdlib/collections/string/string/ord)). * String formatting with [`format()`](/mojo/stdlib/collections/string/string/String/#format). The `String` type has Unicode support through UTF-8 encoding. A handful of operations are known to not be Unicode / UTF-8 compliant yet, but will be fixed as time permits. This type is in the prelude, so it is automatically imported into every Mojo program. Example: ```mojo # String creation and basic operations var s1 = String("Hello") var s2 = String("World") var combined = s1 + " " + s2 # "Hello World" # String-to-number conversion var num = atof("3.14") var int_val = atol("42") # Character operations var char = chr(65) # "A" var code = ord("A") # 65 # String formatting print(String("Codepoint {} is {}").format(code, char)) # Codepoint 65 is A # ASCII utilities var ascii_str = ascii("Hello") # ASCII-only string ``` ## Structs * [​`String`](/mojo/stdlib/collections/string/string/String): Represents a mutable string. ## Functions * [​`ascii`](/mojo/stdlib/collections/string/string/ascii): Get the ASCII representation of the object. * [​`atof`](/mojo/stdlib/collections/string/string/atof): Parses the given string as a floating point and returns that value. * [​`atol`](/mojo/stdlib/collections/string/string/atol): Parses and returns the given string as an integer in the given base. * [​`chr`](/mojo/stdlib/collections/string/string/chr): Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function. * [​`ord`](/mojo/stdlib/collections/string/string/ord): Returns an integer that represents the codepoint of a single-character string. --- ## String `struct String` Represents a mutable string. See the [`string` module](/mojo/stdlib/collections/string/string/) for more information and examples. ## Implemented traits `AnyType`, `Boolable`, `Comparable`, `ConvertibleFromPython`, `Copyable`, `Defaultable`, `EqualityComparable`, `ExplicitlyCopyable`, `FloatableRaising`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `IntableRaising`, `KeyElement`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `PathLike`, `PythonConvertible`, `Representable`, `Sized`, `Stringable`, `TypeIdentifiable`, `UnknownDestructibility`, `Writable`, `Writer`, `_HashableWithHasher` ## Aliases ### `ASCII_LETTERS` `alias ASCII_LETTERS = "abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")` ### `ASCII_LOWERCASE` `alias ASCII_LOWERCASE = "abcdefghijklmnopqrstuvwxyz"` ### `ASCII_UPPERCASE` `alias ASCII_UPPERCASE = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"` ### `DIGITS` `alias DIGITS = "0123456789"` ### `HEX_DIGITS` `alias HEX_DIGITS = "0123456789".__add__[__mlir_type.!kgen.string]("abcdef").__add__[__mlir_type.!kgen.string]("ABCDEF")` ### `OCT_DIGITS` `alias OCT_DIGITS = "01234567"` ### `PRINTABLE` `alias PRINTABLE = "0123456789".__add__[__mlir_type.!kgen.string]("abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")).__add__[__mlir_type.!kgen.string]("!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~").**add**\[\_\_mlir\_type.!kgen.string]\(" \t\n\r\v\f")\` ### `PUNCTUATION` `alias PUNCTUATION = "!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~"\` ### `TYPE_ID` `alias TYPE_ID = "stdlib.String"` ## Methods ### `__init__` `__init__(out self)` Construct an empty string. `__init__(out self, *, capacity: Int)` Construct an empty string with a given capacity. **Args:** * ​capacity (`Int`): The capacity of the string to allocate. `@implicit` `__init__(out self, data: StringSlice[StaticConstantOrigin])` Construct a string from a static constant string without allocating. **Args:** * ​data (`StringSlice[StaticConstantOrigin]`): The static constant string to refer to. `@implicit` `__init__(out self, data: StringLiteral[value])` Construct a string from a string literal without allocating. **Args:** * ​data (`StringLiteral[value]`): The static constant string to refer to. `__init__(out self, *, bytes: Span[SIMD[uint8, 1], origin])` Construct a string by copying the data. This constructor is explicit because it can involve memory allocation. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to copy. `__init__[T: Stringable](out self, value: T)` Initialize from a type conforming to `Stringable`. **Parameters:** * ​T (`Stringable`): The type conforming to Stringable. **Args:** * ​value (`T`): The object to get the string representation of. `__init__[T: StringableRaising](out self, value: T)` Initialize from a type conforming to `StringableRaising`. **Parameters:** * ​T (`StringableRaising`): The type conforming to Stringable. **Args:** * ​value (`T`): The object to get the string representation of. **Raises:** If there is an error when computing the string representation of the type. `__init__[*Ts: Writable](out self, *args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))` Construct a string by concatenating a sequence of Writable arguments. Examples: Construct a String from several `Writable` arguments: ```mojo var string = String(1, 2.0, "three", sep=", ") print(string) # "1, 2.0, three" ``` . **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​\*args (`*Ts`): A sequence of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. `__init__[*Ts: Writable](out self, args: VariadicPack[is_owned, origin, Writable, Ts], sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))` Construct a string by passing a variadic pack. Examples: ```mojo fn variadic_pack_to_string[ *Ts: Writable, ](*args: *Ts) -> String: return String(args) string = variadic_pack_to_string(1, ", ", 2.0, ", ", "three") ``` . **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. `__init__(out self, *, unsafe_uninit_length: UInt)` Construct a String with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data. **Args:** * ​unsafe\_uninit\_length (`UInt`): The number of bytes to allocate. `__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin])` Creates a string from a UTF-8 encoded nul-terminated pointer. Safety: * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8. `__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin])` Creates a string from a UTF-8 encoded nul-terminated pointer. Safety: * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8. `__init__(out self, obj: PythonObject)` Construct a `String` from a PythonObject. **Args:** * ​obj (`PythonObject`): The PythonObject to convert from. **Raises:** An error if the conversion failed. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy initialize the string from another string. **Args:** * ​other (`Self`): The string to copy. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Move initialize the string from another string. **Args:** * ​other (`Self`): The string to move. ### `__del__` `__del__(owned self)` Destroy the string data. ### `__bool__` `__bool__(self) -> Bool` Checks if the string is not empty. **Returns:** True if the string length is greater than zero, and False otherwise. ### `__getitem__` `__getitem__[I: Indexer](self, idx: I) -> Self` Gets the character at the specified position. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index value. **Returns:** A new string containing the character at the specified position. `__getitem__(self, span: Slice) -> Self` Gets the sequence of characters at the specified positions. **Args:** * ​span (`Slice`): A slice that specifies positions of the new substring. **Returns:** A new string containing the string at the specified positions. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Compare this String to the RHS using LT comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True if this String is strictly less than the RHS String and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compare this String to the RHS using LE comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True iff this String is less than or equal to the RHS String. ### `__eq__` `__eq__(self, other: Self) -> Bool` Compares two Strings if they have the same values. **Args:** * ​other (`Self`): The rhs of the operation. **Returns:** True if the Strings are equal and False otherwise. `__eq__(self, other: StringSlice[origin]) -> Bool` Compares two Strings if they have the same values. **Args:** * ​other (`StringSlice[origin]`): The rhs of the operation. **Returns:** True if the Strings are equal and False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compares two Strings if they do not have the same values. **Args:** * ​other (`Self`): The rhs of the operation. **Returns:** True if the Strings are not equal and False otherwise. `__ne__(self, other: StringSlice[origin]) -> Bool` Compares two Strings if they have the same values. **Args:** * ​other (`StringSlice[origin]`): The rhs of the operation. **Returns:** True if the Strings are equal and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Compare this String to the RHS using GT comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True iff this String is strictly greater than the RHS String. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Compare this String to the RHS using GE comparison. **Args:** * ​rhs (`Self`): The other String to compare against. **Returns:** True iff this String is greater than or equal to the RHS String. ### `__contains__` `__contains__(self, substr: StringSlice[origin]) -> Bool` Returns True if the substring is contained within the current string. **Args:** * ​substr (`StringSlice[origin]`): The substring to check. **Returns:** True if the string contains the substring. ### `__add__` `__add__(self, other: StringSlice[origin]) -> Self` Creates a string by appending a string slice at the end. **Args:** * ​other (`StringSlice[origin]`): The string slice to append. **Returns:** The new constructed string. ### `__mul__` `__mul__(self, n: Int) -> Self` Concatenates the string `n` times. **Args:** * ​n (`Int`): The number of times to concatenate the string. **Returns:** The string concatenated `n` times. ### `__radd__` `__radd__(self, other: StringSlice[origin]) -> Self` Creates a string by prepending another string slice to the start. **Args:** * ​other (`StringSlice[origin]`): The string to prepend. **Returns:** The new constructed string. ### `__iadd__` `__iadd__(mut self, other: StringSlice[origin])` Appends another string slice to this string. **Args:** * ​other (`StringSlice[origin]`): The string to append. ### `copy` `copy(self) -> Self` Explicitly copy the provided value. **Returns:** A copy of the value. ### `capacity` `capacity(self) -> UInt` Get the capacity of the string. **Returns:** The capacity of the string. ### `write_bytes` `write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])` Write a byte span to this String. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this String. Must NOT be null terminated. ### `write` `write[*Ts: Writable](mut self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. `static write[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self` Construct a string by concatenating a sequence of Writable arguments. This is used only when reusing the `write_to` method for `__str__` in order to avoid an endless loop recalling the constructor: ```mojo fn write_to[W: Writer](self, mut writer: W): writer.write_bytes(self.as_bytes()) fn __str__(self) -> String: return String.write(self) ``` Otherwise you can use the `String` constructor directly without calling the `String.write` static method: ```mojo var msg = String("my message", 42, 42.2, True) ``` . **Parameters:** * ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy `Writable`. **Args:** * ​\*args (`*Ts`): A sequence of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. **Returns:** A string formed by formatting the argument sequence. ### `append_byte` `append_byte(mut self, byte: SIMD[uint8, 1])` Append a byte to the string. **Args:** * ​byte (`SIMD[uint8, 1]`): The byte to append. ### `__iter__` `__iter__(self) -> CodepointSliceIter[self]` Iterate over the string, returning immutable references. **Returns:** An iterator of references to the string elements. ### `__reversed__` `__reversed__(self) -> CodepointSliceIter[self, False]` Iterate backwards over the string, returning immutable references. **Returns:** A reversed iterator of references to the string elements. ### `__len__` `__len__(self) -> Int` Get the string length of in bytes. This function returns the number of bytes in the underlying UTF-8 representation of the string. To get the number of Unicode codepoints in a string, use `len(str.codepoints())`. # Examples Query the length of a string, in bytes and Unicode codepoints: ```mojo from testing import assert_equal var s = String("ನಮಸ್ಕಾರ") assert_equal(len(s), 21) assert_equal(len(s.codepoints()), 7) ``` Strings containing only ASCII characters have the same byte and Unicode codepoint length: ```mojo from testing import assert_equal var s = String("abc") assert_equal(len(s), 3) assert_equal(len(s.codepoints()), 3) ``` . **Returns:** The string length in bytes. ### `__str__` `__str__(self) -> Self` Gets the string itself. This method ensures that you can pass a `String` to a method that takes a `Stringable` value. **Returns:** The string itself. ### `__repr__` `__repr__(self) -> Self` Return a Mojo-compatible representation of the `String` instance. **Returns:** A new representation of the string. ### `__fspath__` `__fspath__(self) -> Self` Return the file system path representation (just the string itself). **Returns:** The file system path representation as a string. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `join` `join[*Ts: Writable](self, *elems: *Ts) -> Self` Joins string elements using the current string as a delimiter. **Parameters:** * ​\*Ts (`Writable`): The types of the elements. **Args:** * ​\*elems (`*Ts`): The input values. **Returns:** The joined string. `join[T: Copyable & Movable & Writable, //, buffer_size: Int = 4096](self, elems: List[T, hint_trivial_type]) -> Self` Joins string elements using the current string as a delimiter. Defaults to writing to the stack if total bytes of `elems` is less than `buffer_size`, otherwise will allocate once to the heap and write directly into that. The `buffer_size` defaults to 4096 bytes to match the default page size on arm64 and x86-64, but you can increase this if you're joining a very large `List` of elements to write into the stack instead of the heap. **Parameters:** * ​T (`Copyable & Movable & Writable`): The type of the elements. Must implement the `Copyable`, `Movable` and `Writable` traits. * ​buffer\_size (`Int`): The max size of the stack buffer. **Args:** * ​elems (`List[T, hint_trivial_type]`): The input values. **Returns:** The joined string. ### `codepoints` `codepoints(self) -> CodepointsIter[self]` Returns an iterator over the `Codepoint`s encoded in this string slice. # Examples Print the characters in a string: ```mojo from testing import assert_equal var s = String("abc") var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) assert_equal(iter.__next__(), Codepoint.ord("b")) assert_equal(iter.__next__(), Codepoint.ord("c")) assert_equal(iter.__has_next__(), False) ``` `codepoints()` iterates over Unicode codepoints, and supports multibyte codepoints: ```mojo from testing import assert_equal # A visual character composed of a combining sequence of 2 codepoints. var s = String("á") assert_equal(s.byte_length(), 3) var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) # U+0301 Combining Acute Accent assert_equal(iter.__next__().to_u32(), 0x0301) assert_equal(iter.__has_next__(), False) ``` . **Returns:** An iterator type that returns successive `Codepoint` values stored in this string slice. ### `codepoint_slices` `codepoint_slices(self) -> CodepointSliceIter[self]` Returns an iterator over single-character slices of this string. Each returned slice points to a single Unicode codepoint encoded in the underlying UTF-8 representation of this string. # Examples Iterate over the character slices in a string: ```mojo from testing import assert_equal, assert_true var s = String("abc") var iter = s.codepoint_slices() assert_true(iter.__next__() == "a") assert_true(iter.__next__() == "b") assert_true(iter.__next__() == "c") assert_equal(iter.__has_next__(), False) ``` . **Returns:** An iterator of references to the string elements. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=self]` Retrieves a pointer to the underlying memory. **Returns:** The pointer to the underlying memory. ### `unsafe_ptr_mut` `unsafe_ptr_mut(mut self) -> UnsafePointer[SIMD[uint8, 1], origin=self]` Retrieves a mutable pointer to the underlying memory, copying to a new buffer if this was previously pointing to a static constant. **Returns:** The pointer to the underlying memory. ### `unsafe_cstr_ptr` `unsafe_cstr_ptr(mut self) -> UnsafePointer[SIMD[int8, 1], origin=self]` Retrieves a C-string-compatible pointer to the underlying memory. The returned pointer is guaranteed to be null, or NUL terminated. **Returns:** The pointer to the underlying memory. ### `as_bytes` `as_bytes(self) -> Span[SIMD[uint8, 1], self]` Returns a contiguous slice of the bytes owned by this string. **Returns:** A contiguous slice pointing to the bytes owned by this string. ### `as_bytes_mut` `as_bytes_mut(mut self) -> Span[SIMD[uint8, 1], self]` Returns a mutable contiguous slice of the bytes owned by this string. This name has a \_mut suffix so the as\_bytes() method doesn't have to guarantee mutability. **Returns:** A contiguous slice pointing to the bytes owned by this string. ### `as_string_slice` `as_string_slice(self) -> StringSlice[self]` Returns a string slice of the data owned by this string. **Returns:** A string slice pointing to the data owned by this string. ### `as_string_slice_mut` `as_string_slice_mut(mut self) -> StringSlice[self]` Returns a mutable string slice of the data owned by this string. **Returns:** A string slice pointing to the data owned by this string. ### `byte_length` `byte_length(self) -> Int` Get the string length in bytes. **Returns:** The length of this string in bytes. ### `count` `count(self, substr: StringSlice[origin]) -> Int` Return the number of non-overlapping occurrences of substring `substr` in the string. If sub is empty, returns the number of empty strings between characters which is the length of the string plus one. **Args:** * ​substr (`StringSlice[origin]`): The substring to count. **Returns:** The number of occurrences of `substr`. ### `find` `find(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `rfind` `rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `isspace` `isspace(self) -> Bool` Determines whether every character in the given String is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Returns:** True if the whole String is made up of whitespace characters listed above, otherwise False. ### `split` `split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[String]` Split the string by a separator. Examples: ```mojo # Splitting a space _ = String("hello world").split(" ") # ["hello", "world"] # Splitting adjacent separators _ = String("hello,,world").split(",") # ["hello", "", "world"] # Splitting with maxsplit _ = String("1,2,3").split(",", 1) # ['1', '2,3'] ``` . **Args:** * ​sep (`StringSlice[origin]`): The string to split on. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. **Raises:** If the separator is empty. `split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[String]` Split the string by every Whitespace separator. Examples: ```mojo # Splitting an empty string or filled with whitespaces _ = String(" ").split() # [] _ = String("").split() # [] # Splitting a string with leading, trailing, and middle whitespaces _ = String(" hello world ").split() # ["hello", "world"] # Splitting adjacent universal newlines: _ = String( "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world" ).split() # ["hello", "world"] ``` . **Args:** * ​sep (`NoneType`): None. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. ### `splitlines` `splitlines(self, keepends: Bool = False) -> List[String]` Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Args:** * ​keepends (`Bool`): If True, line breaks are kept in the resulting strings. **Returns:** A List of Strings containing the input split by line boundaries. ### `replace` `replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> Self` Return a copy of the string with all occurrences of substring `old` if replaced by `new`. **Args:** * ​old (`StringSlice[origin]`): The substring to replace. * ​new (`StringSlice[origin]`): The substring to replace with. **Returns:** The string where all occurrences of `old` are replaced with `new`. ### `strip` `strip(self, chars: StringSlice[origin]) -> StringSlice[self]` Return a copy of the string with leading and trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading or trailing characters. `strip(self) -> StringSlice[self]` Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no leading or trailing whitespaces. ### `rstrip` `rstrip(self, chars: StringSlice[origin]) -> StringSlice[self]` Return a copy of the string with trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no trailing characters. `rstrip(self) -> StringSlice[self]` Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no trailing whitespaces. ### `lstrip` `lstrip(self, chars: StringSlice[origin]) -> StringSlice[self]` Return a copy of the string with leading characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading characters. `lstrip(self) -> StringSlice[self]` Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no leading whitespaces. ### `__hash__` `__hash__(self) -> UInt` Hash the underlying buffer using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `lower` `lower(self) -> Self` Returns a copy of the string with all cased characters converted to lowercase. **Returns:** A new string where cased letters have been converted to lowercase. ### `upper` `upper(self) -> Self` Returns a copy of the string with all cased characters converted to uppercase. **Returns:** A new string where cased letters have been converted to uppercase. ### `startswith` `startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string starts with the specified prefix between start and end positions. Returns True if found and False otherwise. **Args:** * ​prefix (`StringSlice[origin]`): The prefix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is prefixed by the input prefix. ### `endswith` `endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string end with the specified suffix between start and end positions. Returns True if found and False otherwise. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is suffixed by the input suffix. ### `removeprefix` `removeprefix(self, prefix: StringSlice[origin], /) -> StringSlice[self]` Returns a new string with the prefix removed if it was present. Examples: ```mojo print(String('TestHook').removeprefix('Test')) # 'Hook' print(String('BaseTestCase').removeprefix('Test')) # 'BaseTestCase' ``` **Args:** * ​prefix (`StringSlice[origin]`): The prefix to remove from the string. **Returns:** `string[len(prefix):]` if the string starts with the prefix string, or a copy of the original string otherwise. ### `removesuffix` `removesuffix(self, suffix: StringSlice[origin], /) -> StringSlice[self]` Returns a new string with the suffix removed if it was present. Examples: ```mojo print(String('TestHook').removesuffix('Hook')) # 'Test' print(String('BaseTestCase').removesuffix('Test')) # 'BaseTestCase' ``` **Args:** * ​suffix (`StringSlice[origin]`): The suffix to remove from the string. **Returns:** `string[:-len(suffix)]` if the string ends with the suffix string, or a copy of the original string otherwise. ### `__int__` `__int__(self) -> Int` Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised. **Returns:** An integer value that represents the string, or otherwise raises. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised. **Returns:** A float value that represents the string, or otherwise raises. ### `format` `format[*Ts: Stringable & Representable](self, *args: *Ts) -> Self` Produce a formatted string using the current string as a template. The template, or "format string" can contain literal text and/or replacement fields delimited with curly braces (`{}`). Returns a copy of the format string with the replacement fields replaced with string representations of the `args` arguments. For more information, see the discussion in the [`format` module](/mojo/stdlib/collections/string/format/). Example: ```mojo # Manual indexing: print(String("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo # Automatic indexing: print(String("{} {}").format(True, "hello world")) # True hello world ``` **Parameters:** * ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable` and `Stringable` (to be changed and made more flexible). **Args:** * ​\*args (`*Ts`): The substitution values. **Returns:** The template with the given values substituted. ### `isdigit` `isdigit(self) -> Bool` A string is a digit string if all characters in the string are digits and there is at least one character in the string. Note that this currently only works with ASCII strings. **Returns:** True if all characters are digits and it's not empty else False. ### `isupper` `isupper(self) -> Bool` Returns True if all cased characters in the string are uppercase and there is at least one cased character. **Returns:** True if all cased characters in the string are uppercase and there is at least one cased character, False otherwise. ### `islower` `islower(self) -> Bool` Returns True if all cased characters in the string are lowercase and there is at least one cased character. **Returns:** True if all cased characters in the string are lowercase and there is at least one cased character, False otherwise. ### `isprintable` `isprintable(self) -> Bool` Returns True if all characters in the string are ASCII printable. Note that this currently only works with ASCII strings. **Returns:** True if all characters are printable else False. ### `rjust` `rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self` Returns the string right justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns right justified string, or self if width is not bigger than self length. ### `ljust` `ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self` Returns the string left justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns left justified string, or self if width is not bigger than self length. ### `center` `center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self` Returns the string center justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns center justified string, or self if width is not bigger than self length. ### `resize` `resize(mut self, length: Int, fill_byte: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0))` Resize the string to a new length. Notes: If the new length is greater than the current length, the string is extended by the difference, and the new bytes are initialized to `fill_byte`. **Args:** * ​length (`Int`): The new length of the string. * ​fill\_byte (`SIMD[uint8, 1]`): The byte to fill any new space with. `resize(mut self, *, unsafe_uninit_length: Int)` Resizes the string to the given new size leaving any new data uninitialized. If the new size is smaller than the current one, elements at the end are discarded. If the new size is larger than the current one, the string is extended and the new data is left uninitialized. **Args:** * ​unsafe\_uninit\_length (`Int`): The new size. ### `reserve` `reserve(mut self, new_capacity: UInt)` Reserves the requested capacity. Notes: If the current capacity is greater or equal, this is a no-op. Otherwise, the storage is reallocated and the data is moved. **Args:** * ​new\_capacity (`UInt`): The new capacity in stored bytes. --- ## string_literal Implements the StringLiteral struct. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral): This type represents a string literal. --- ## string_slice The `StringSlice` type implementation for efficient string operations. This module provides the `StringSlice` type, which is a lightweight view into string data that enables zero-copy string operations. `StringSlice` is designed for high-performance string manipulation while maintaining memory safety and UTF-8 awareness. The `StringSlice` type is particularly useful for: * High-performance string operations without copying. * Efficient string parsing and tokenization. `StaticString` is an alias for an immutable constant `StringSlice`. `StringSlice` and `StaticString` are in the prelude, so they are automatically imported into every Mojo program. Example: ```mojo # Create a string slice var text = StringSlice("Hello, 世界") # Zero-copy slicing var hello = text[0:5] # Hello # Unicode-aware operations var world = text[7:13] # "世界" # String comparison if text.startswith("Hello"): print("Found greeting") # String formatting var format_string = StaticString("{}: {}") print(format_string.format("bats", 6)) # bats: 6 ``` ## Aliases ### `StaticString` `alias StaticString = StringSlice[StaticConstantOrigin]` An immutable static string slice. ## Structs * [​`CodepointsIter`](/mojo/stdlib/collections/string/string_slice/CodepointsIter): Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`. * [​`CodepointSliceIter`](/mojo/stdlib/collections/string/string_slice/CodepointSliceIter): Iterator for `StringSlice` over substring slices containing a single Unicode codepoint. * [​`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice): A non-owning view to encoded string data. ## Functions * [​`get_static_string`](/mojo/stdlib/collections/string/string_slice/get_static_string): Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory. It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range. --- ## Stringable The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). Any type that conforms to `Stringable` or [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) works with the built-in [`print()`](/mojo/stdlib/builtin/io/print) and [`String()`](/mojo/stdlib/builtin/str/str) functions. The `Stringable` trait requires the type to define the `__str__()` method. For example: ```mojo struct Foo(Stringable): var s: String fn __str__(self) -> String: return self.s ``` Now you can pass an instance of `Foo` to the `String()` function to get back a `String`: ```mojo var foo = Foo("test") print(String(foo) == "test") ``` ```plaintext True ``` **Note:** If the `__str__()` method might raise an error, use the [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) trait, instead. About the difference between `__repr__()` and `__str__()`: The method `__repr__` computes the "official" string representation of an object while `__str__` computes the "informal" or nicely printable string representation of an object. This method differs from `__repr__()` in that there is no expectation that `__str__()` return a valid Mojo expression: a more convenient or concise representation can be used. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__str__` `__str__(self: _Self) -> String` Get the string representation of the type. **Returns:** The string representation of the type. --- ## StringableRaising The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String). Any type that conforms to [`Stringable`](/mojo/stdlib/builtin/str/Stringable) or `StringableRaising` works with the built-in [`print()`](/mojo/stdlib/builtin/io/print) and [`String()`](/mojo/stdlib/builtin/str/str) functions. The `StringableRaising` trait requires the type to define the `__str__()` method, which can raise an error. For example: ```mojo struct Foo(StringableRaising): var s: String fn __str__(self) raises -> String: if self.s == "": raise Error("Empty String") return self.s ``` Now you can pass an instance of `Foo` to the `String()` function to get back a `String`: ```mojo fn main() raises: var foo = Foo("test") print(String(foo) == "test") ``` ```plaintext True ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__str__` `__str__(self: _Self) -> String` Get the string representation of the type. **Returns:** The string representation of the type. **Raises:** If there is an error when computing the string representation of the type. --- ## StringLiteral `@register_passable(trivial)` `struct StringLiteral[value: string]` This type represents a string literal. String literals are all null-terminated for compatibility with C APIs, but this is subject to change. String literals store their length as an integer, and this does not include the null terminator. ## Parameters * ​value (`string`): The underlying string value. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `ExplicitlyCopyable`, `FloatableRaising`, `IntableRaising`, `Movable`, `PathLike`, `PythonConvertible`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Constructor for any value. ### `__bool__` `__bool__(self) -> Bool` Convert the string to a bool value. **Returns:** True if the string is not empty. ### `__getitem__` `__getitem__[IndexerType: Indexer](self, idx: IndexerType) -> String` Gets the character at the specified position. **Parameters:** * ​IndexerType (`Indexer`): The inferred type of an indexer argument. **Args:** * ​idx (`IndexerType`): The index value. **Returns:** A new string containing the character at the specified position. ### `__lt__` `__lt__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using lesser than (LT) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is strictly less than the RHS and False otherwise. ### `__le__` `__le__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using lesser than or equal to (LE) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is less than or equal to the RHS and False otherwise. ### `__eq__` `__eq__(self, rhs: StringSlice[origin]) -> Bool` Compare two string literals for equality. **Args:** * ​rhs (`StringSlice[origin]`): The string to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: StringSlice[origin]) -> Bool` Compare two string literals for inequality. **Args:** * ​rhs (`StringSlice[origin]`): The string to compare. **Returns:** True if they are not equal. ### `__gt__` `__gt__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using greater than (GT) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is strictly greater than the RHS and False otherwise. ### `__ge__` `__ge__(self, rhs: StringSlice[origin]) -> Bool` Compare this value to the RHS using greater than or equal to (GE) comparison. **Args:** * ​rhs (`StringSlice[origin]`): The other value to compare against. **Returns:** True if this is greater than or equal to the RHS and False otherwise. ### `__add__` `__add__(self, rhs: StringLiteral[value]) -> StringLiteral[#pop.string_concat]` Concatenate two string literals. **Args:** * ​rhs (`StringLiteral[value]`): The string to concat. **Returns:** The concatenated string. ### `__mul__` `__mul__(self, n: Int) -> String` Concatenates the string `n` times. **Args:** * ​n (`Int`): The number of times to concatenate the string. **Returns:** The string concatenated `n` times. ### `copy` `copy(self) -> Self` Copy constructor. **Returns:** A copy of the value. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `__len__` `__len__(self) -> Int` Get the string length. **Returns:** The length of this value. ### `__int__` `__int__(self) -> Int` Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised. **Returns:** An integer value that represents the string, or otherwise raises. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised. **Returns:** A float value that represents the string, or otherwise raises. ### `__str__` `__str__(self) -> String` Convert the string literal to a string. **Returns:** A new string. ### `__repr__` `__repr__(self) -> String` Return a representation of this value. You don't need to call this method directly, use `repr("...")` instead. **Returns:** A new representation of the string. ### `__fspath__` `__fspath__(self) -> String` Return the file system path representation of the object. **Returns:** The file system path representation as a string. ### `__iter__` `__iter__(self) -> CodepointSliceIter[StaticConstantOrigin]` Return an iterator over the string literal. **Returns:** An iterator over the string. ### `__reversed__` `__reversed__(self) -> CodepointSliceIter[StaticConstantOrigin, False]` Iterate backwards over the string, returning immutable references. **Returns:** A reversed iterator over the string. ### `__merge_with__` `__merge_with__[: string, //, other_type: AnyStruct[StringLiteral[$0]]](self) -> StringSlice[StaticConstantOrigin]` Returns a StaticString after merging with another string literal. **Parameters:** * ​other\_type (`AnyStruct[StringLiteral[$0]]`): The type of the string literal to merge with. **Returns:** A StaticString after merging with the specified `other_type`. ### `byte_length` `byte_length(self) -> Int` Get the string length in bytes. Notes: This does not include the trailing null terminator in the count. **Returns:** The length of this string in bytes. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=StaticConstantOrigin]` Get raw pointer to the underlying data. **Returns:** The raw pointer to the data. ### `unsafe_cstr_ptr` `unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1], mut=False, origin=StaticConstantOrigin]` Retrieves a C-string-compatible pointer to the underlying memory. The returned pointer is guaranteed to be NUL terminated, and not null. **Returns:** The pointer to the underlying memory. ### `as_string_slice` `as_string_slice(self) -> StringSlice[StaticConstantOrigin]` Returns a string slice of this static string literal. **Returns:** A string slice pointing to this static string literal. ### `as_bytes` `as_bytes(self) -> Span[SIMD[uint8, 1], StaticConstantOrigin]` Returns a contiguous Span of the bytes owned by this string. **Returns:** A contiguous slice pointing to the bytes owned by this string. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string literal to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `find` `find(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int` Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `rfind` `rfind(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int` Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1. **Args:** * ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find. * ​start (`Int`): The offset from which to find. **Returns:** The offset of `substr` relative to the beginning of the string. ### `count` `count(self, substr: StringSlice[origin]) -> Int` Return the number of non-overlapping occurrences of substring `substr` in the string literal. If sub is empty, returns the number of empty strings between characters which is the length of the string plus one. **Args:** * ​substr (`StringSlice[origin]`): The substring to count. **Returns:** The number of occurrences of `substr`. ### `lower` `lower(self) -> String` Returns a copy of the string literal with all cased characters converted to lowercase. **Returns:** A new string where cased letters have been converted to lowercase. ### `upper` `upper(self) -> String` Returns a copy of the string literal with all cased characters converted to uppercase. **Returns:** A new string where cased letters have been converted to uppercase. ### `rjust` `rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string right justified in a string literal of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns right justified string, or self if width is not bigger than self length. ### `ljust` `ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string left justified in a string literal of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns left justified string, or self if width is not bigger than self length. ### `center` `center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string center justified in a string literal of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns center justified string, or self if width is not bigger than self length. ### `startswith` `startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string literal starts with the specified prefix between start and end positions. Returns True if found and False otherwise. **Args:** * ​prefix (`StringSlice[origin]`): The prefix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is prefixed by the input prefix. ### `endswith` `endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Checks if the string literal end with the specified suffix between start and end positions. Returns True if found and False otherwise. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to check. * ​start (`Int`): The start offset from which to check. * ​end (`Int`): The end offset from which to check. **Returns:** True if the `self[start:end]` is suffixed by the input suffix. ### `isdigit` `isdigit(self) -> Bool` Returns True if all characters in the string literal are digits. Note that this currently only works with ASCII strings. **Returns:** True if all characters are digits else False. ### `isupper` `isupper(self) -> Bool` Returns True if all cased characters in the string literal are uppercase and there is at least one cased character. Note that this currently only works with ASCII strings. **Returns:** True if all cased characters in the string literal are uppercase and there is at least one cased character, False otherwise. ### `islower` `islower(self) -> Bool` Returns True if all cased characters in the string literal are lowercase and there is at least one cased character. Note that this currently only works with ASCII strings. **Returns:** True if all cased characters in the string literal are lowercase and there is at least one cased character, False otherwise. ### `strip` `strip(self) -> String` Return a copy of the string literal with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A string with no leading or trailing whitespaces. `strip(self, chars: StringSlice[origin]) -> String` Return a copy of the string literal with leading and trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A string with no leading or trailing characters. ### `rstrip` `rstrip(self, chars: StringSlice[origin]) -> String` Return a copy of the string literal with trailing characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A string with no trailing characters. `rstrip(self) -> String` Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no trailing whitespaces. ### `lstrip` `lstrip(self, chars: StringSlice[origin]) -> String` Return a copy of the string with leading characters removed. **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading characters. `lstrip(self) -> String` Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. **Returns:** A copy of the string with no leading whitespaces. --- ## StringSlice `@register_passable(trivial)` `struct StringSlice[mut: Bool, //, origin: Origin[mut]]` A non-owning view to encoded string data. This type is guaranteed to have the same ABI (size, alignment, and field layout) as the `llvm::StringRef` type. See the [`string_slice` module](/mojo/stdlib/collections/string/string_slice/) for more information and examples. Notes: TODO: The underlying string data is guaranteed to be encoded using UTF-8. ## Parameters * ​mut (`Bool`): Whether the slice is mutable. * ​origin (`Origin[mut]`): The origin of the underlying string data. ## Implemented traits `AnyType`, `Boolable`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `FloatableRaising`, `Hashable`, `IntableRaising`, `KeyElement`, `Movable`, `PathLike`, `PythonConvertible`, `Representable`, `Sized`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `Immutable` `alias Immutable = StringSlice[(muttoimm origin._mlir_origin)]` The immutable version of the `StringSlice`. ### `Mutable` `alias Mutable = StringSlice[(mutcast origin._mlir_origin)]` The mutable version of the `StringSlice`. ## Methods ### `__init__` `__init__() -> Self` Create an empty / zero-length slice. `@implicit` `__init__(lit: StringLiteral[value]) -> StringSlice[StaticConstantOrigin]` Construct a new `StringSlice` from a `StringLiteral`. **Args:** * ​lit (`StringLiteral[value]`): The literal to construct this `StringSlice` from. `__init__(*, unsafe_from_utf8: Span[SIMD[uint8, 1], origin]) -> Self` Construct a new `StringSlice` from a sequence of UTF-8 encoded bytes. Safety: `unsafe_from_utf8` MUST be valid UTF-8 encoded data. **Args:** * ​unsafe\_from\_utf8 (`Span[SIMD[uint8, 1], origin]`): A `Span[Byte]` encoded in UTF-8. `__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Self` Construct a new StringSlice from a `UnsafePointer[Byte]` pointing to null-terminated UTF-8 encoded bytes. Safety: * `unsafe_from_utf8_ptr` MUST point to data that is valid for `origin`. * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1]]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8. `__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1]]) -> Self` Construct a new StringSlice from a `UnsafePointer[c_char]` pointing to null-terminated UTF-8 encoded bytes. Safety: * `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data. * `unsafe_from_utf8_ptr` MUST be null terminated. **Args:** * ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1]]`): An `UnsafePointer[c_char]` of null-terminated bytes encoded in UTF-8. `__init__(*, ptr: UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin], length: UInt) -> Self` Construct a `StringSlice` from a pointer to a sequence of UTF-8 encoded bytes and a length. Safety: * `ptr` MUST point to at least `length` bytes of valid UTF-8 encoded data. * `ptr` must point to data that is live for the duration of `origin`. **Args:** * ​ptr (`UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`): A pointer to a sequence of bytes encoded in UTF-8. * ​length (`UInt`): The number of bytes of encoded data. `@implicit` `__init__[origin: ImmutableOrigin, //](ref [origin] value: String) -> StringSlice[origin]` Construct an immutable StringSlice. **Parameters:** * ​origin (`ImmutableOrigin`): The immutable origin. **Args:** * ​value (`String`): The string value. ### `__bool__` `__bool__(self) -> Bool` Check if a string slice is non-empty. **Returns:** True if a string slice is non-empty, False otherwise. ### `__getitem__` `__getitem__(self, span: Slice) -> Self` Gets the sequence of characters at the specified positions. Raises: This function will raise if the specified slice start or end position are outside the bounds of the string, or if they do not both fall on codepoint boundaries. **Args:** * ​span (`Slice`): A slice that specifies positions of the new substring. **Returns:** A new StringSlice containing the substring at the specified positions. `__getitem__[I: Indexer](self, idx: I) -> String` Gets the character at the specified position. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index value. **Returns:** A new string containing the character at the specified position. ### `__lt__` `__lt__(self, rhs: StringSlice[origin]) -> Bool` Verify if the `StringSlice` bytes are strictly less than the input in overlapping content. **Args:** * ​rhs (`StringSlice[origin]`): The other `StringSlice` to compare against. **Returns:** If the `StringSlice` bytes are strictly less than the input in overlapping content. ### `__eq__` `__eq__(self, rhs_same: Self) -> Bool` Verify if a `StringSlice` is equal to another `StringSlice` with the same origin. **Args:** * ​rhs\_same (`Self`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is equal to the input in length and contents. `__eq__(self, rhs: StringSlice[origin]) -> Bool` Verify if a `StringSlice` is equal to another `StringSlice`. **Args:** * ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is equal to the input in length and contents. ### `__ne__` `__ne__(self, rhs_same: Self) -> Bool` Verify if a `StringSlice` is not equal to another `StringSlice` with the same origin. **Args:** * ​rhs\_same (`Self`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is not equal to the input in length and contents. `__ne__(self, rhs: StringSlice[origin]) -> Bool` Verify if span is not equal to another `StringSlice`. **Args:** * ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against. **Returns:** If the `StringSlice` is not equal to the input in length and contents. ### `__contains__` `__contains__(self, substr: StringSlice[origin]) -> Bool` Returns True if the substring is contained within the current string. **Args:** * ​substr (`StringSlice[origin]`): The substring to check. **Returns:** True if the string contains the substring. ### `__add__` `__add__(self, rhs: StringSlice[origin]) -> String` Returns a string with this value prefixed on another string. **Args:** * ​rhs (`StringSlice[origin]`): The right side of the result. **Returns:** The result string. ### `__mul__` `__mul__(self, n: Int) -> String` Concatenates the string `n` times. **Args:** * ​n (`Int`): The number of times to concatenate the string. **Returns:** The string concatenated `n` times. ### `__radd__` `__radd__(self, lhs: StringSlice[origin]) -> String` Returns a string with this value appended to another string. **Args:** * ​lhs (`StringSlice[origin]`): The left side of the result. **Returns:** The result string. ### `copy` `copy(self) -> Self` Explicitly construct a deep copy of the provided `StringSlice`. **Returns:** A copy of the value. ### `from_utf8` `static from_utf8(from_utf8: Span[SIMD[uint8, 1], origin]) -> Self` Construct a new `StringSlice` from a buffer containing UTF-8 encoded data. **Args:** * ​from\_utf8 (`Span[SIMD[uint8, 1], origin]`): A span of bytes containing UTF-8 encoded data. **Returns:** A new validated `StringSlice` pointing to the provided buffer. **Raises:** An exception is raised if the provided buffer byte values do not form valid UTF-8 encoded codepoints. ### `__str__` `__str__(self) -> String` Convert this StringSlice to a String. Notes: This will allocate a new string that copies the string contents from the provided string slice. **Returns:** A new String. ### `__repr__` `__repr__(self) -> String` Return a Mojo-compatible representation of this string slice. **Returns:** Representation of this string slice as a Mojo string literal input form syntax. ### `__len__` `__len__(self) -> Int` Get the string length in bytes. This function returns the number of bytes in the underlying UTF-8 representation of the string. To get the number of Unicode codepoints in a string, use `len(str.codepoints())`. # Examples Query the length of a string, in bytes and Unicode codepoints: ```mojo from testing import assert_equal var s = StringSlice("ನಮಸ್ಕಾರ") assert_equal(len(s), 21) assert_equal(len(s.codepoints()), 7) ``` Strings containing only ASCII characters have the same byte and Unicode codepoint length: ```mojo from testing import assert_equal var s = StringSlice("abc") assert_equal(len(s), 3) assert_equal(len(s.codepoints()), 3) ``` . **Returns:** The string length in bytes. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this string slice to the provided `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the `Writable` trait. **Args:** * ​writer (`W`): The object to write to. ### `__hash__` `__hash__(self) -> UInt` Hash the underlying buffer using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with the underlying bytes. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. ### `__fspath__` `__fspath__(self) -> String` Return the file system path representation of this string. **Returns:** The file system path representation as a string. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert this value to a PythonObject. **Returns:** A PythonObject representing the value. ### `__iter__` `__iter__(self) -> CodepointSliceIter[origin]` Iterate over the string, returning immutable references. **Returns:** An iterator of references to the string elements. ### `__reversed__` `__reversed__(self) -> CodepointSliceIter[origin, False]` Iterate backwards over the string, returning immutable references. **Returns:** A reversed iterator of references to the string elements. ### `__int__` `__int__(self) -> Int` Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised. **Returns:** An integer value that represents the string, or otherwise raises. ### `__float__` `__float__(self) -> SIMD[float64, 1]` Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised. **Returns:** A float value that represents the string, or otherwise raises. ### `__merge_with__` `__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[StringSlice[$1]]](self) -> StringSlice[origin]` Returns a string slice with merged origins. **Parameters:** * ​other\_type (`AnyStruct[StringSlice[$1]]`): The type of the origin to merge with. **Returns:** A StringSlice merged with the other origin. ### `get_immutable` `get_immutable(self) -> StringSlice[(muttoimm origin._mlir_origin)]` Return an immutable version of this Span. **Returns:** An immutable version of the same Span. ### `replace` `replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> String` Return a copy of the string with all occurrences of substring `old` if replaced by `new`. **Args:** * ​old (`StringSlice[origin]`): The substring to replace. * ​new (`StringSlice[origin]`): The substring to replace with. **Returns:** The string where all occurrences of `old` are replaced with `new`. ### `split` `split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]` Split the string by a separator. Examples: ```mojo # Splitting a space _ = StringSlice("hello world").split(" ") # ["hello", "world"] # Splitting adjacent separators _ = StringSlice("hello,,world").split(",") # ["hello", "", "world"] # Splitting with maxsplit _ = StringSlice("1,2,3").split(",", 1) # ['1', '2,3'] ``` **Args:** * ​sep (`StringSlice[origin]`): The string to split on. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. **Raises:** If the separator is empty. `split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]` Split the string by every Whitespace separator. Examples: ```mojo # Splitting an empty string or filled with whitespaces _ = StringSlice(" ").split() # [] _ = StringSlice("").split() # [] # Splitting a string with leading, trailing, and middle whitespaces _ = StringSlice(" hello world ").split() # ["hello", "world"] # Splitting adjacent universal newlines: _ = StringSlice( "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world" ).split() # ["hello", "world"] ``` **Args:** * ​sep (`NoneType`): None. * ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults to unlimited. **Returns:** A List of Strings containing the input split by the separator. ### `strip` `strip(self, chars: StringSlice[origin]) -> Self` Return a copy of the string with leading and trailing characters removed. Example: ```mojo print("himojohi".strip("hi")) # "mojo" ``` **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading or trailing characters. `strip(self) -> Self` Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. Example: ```mojo print(" mojo ".strip()) # "mojo" ``` **Returns:** A copy of the string with no leading or trailing whitespaces. ### `rstrip` `rstrip(self, chars: StringSlice[origin]) -> Self` Return a copy of the string with trailing characters removed. Example: ```mojo print("mojohi".strip("hi")) # "mojo" ``` **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no trailing characters. `rstrip(self) -> Self` Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. Example: ```mojo print("mojo ".strip()) # "mojo" ``` **Returns:** A copy of the string with no trailing whitespaces. ### `lstrip` `lstrip(self, chars: StringSlice[origin]) -> Self` Return a copy of the string with leading characters removed. Example: ```mojo print("himojo".strip("hi")) # "mojo" ``` **Args:** * ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace. **Returns:** A copy of the string with no leading characters. `lstrip(self) -> Self` Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`. Example: ```mojo print(" mojo".strip()) # "mojo" ``` **Returns:** A copy of the string with no leading whitespaces. ### `codepoints` `codepoints(self) -> CodepointsIter[origin]` Returns an iterator over the `Codepoint`s encoded in this string slice. # Examples Print the characters in a string: ```mojo from testing import assert_equal var s = StringSlice("abc") var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) assert_equal(iter.__next__(), Codepoint.ord("b")) assert_equal(iter.__next__(), Codepoint.ord("c")) assert_equal(iter.__has_next__(), False) ``` `codepoints()` iterates over Unicode codepoints, and supports multibyte codepoints: ```mojo from testing import assert_equal # A visual character composed of a combining sequence of 2 codepoints. var s = StringSlice("á") assert_equal(s.byte_length(), 3) var iter = s.codepoints() assert_equal(iter.__next__(), Codepoint.ord("a")) # U+0301 Combining Acute Accent assert_equal(iter.__next__().to_u32(), 0x0301) assert_equal(iter.__has_next__(), False) ``` . **Returns:** An iterator type that returns successive `Codepoint` values stored in this string slice. ### `codepoint_slices` `codepoint_slices(self) -> CodepointSliceIter[origin]` Iterate over the string, returning immutable references. **Returns:** An iterator of references to the string elements. ### `as_bytes` `as_bytes(self) -> Span[SIMD[uint8, 1], origin]` Get the sequence of encoded bytes of the underlying string. **Returns:** A slice containing the underlying sequence of encoded bytes. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]` Gets a pointer to the first element of this string slice. **Returns:** A pointer pointing at the first element of this string slice. ### `byte_length` `byte_length(self) -> Int` Get the length of this string slice in bytes. **Returns:** The length of this string slice in bytes. ### `char_length` `char_length(self) -> UInt` Returns the length in Unicode codepoints. This returns the number of `Codepoint` codepoint values encoded in the UTF-8 representation of this string. Note: To get the length in bytes, use `StringSlice.byte_length()`. # Examples Query the length of a string, in bytes and Unicode codepoints: ```mojo from testing import assert_equal var s = StringSlice("ನಮಸ್ಕಾರ") assert_equal(s.char_length(), 7) assert_equal(len(s), 21) ``` Strings containing only ASCII characters have the same byte and Unicode codepoint length: ```mojo from testing import assert_equal var s = StringSlice("abc") assert_equal(s.char_length(), 3) assert_equal(len(s), 3) ``` The character length of a string with visual combining characters is the length in Unicode codepoints, not grapheme clusters: ```mojo from testing import assert_equal var s = StringSlice("á") assert_equal(s.char_length(), 2) assert_equal(s.byte_length(), 3) ``` . **Returns:** The length in Unicode codepoints. ### `is_codepoint_boundary` `is_codepoint_boundary(self, index: UInt) -> Bool` Returns True if `index` is the position of the first byte in a UTF-8 codepoint sequence, or is at the end of the string. A byte position is considered a codepoint boundary if a valid subslice of the string would end (noninclusive) at `index`. Positions `0` and `len(self)` are considered to be codepoint boundaries. Positions beyond the length of the string slice will return False. Examples: Check if particular byte positions are codepoint boundaries: ```mojo from testing import assert_equal, assert_true, assert_false var abc = StringSlice("abc") assert_equal(len(abc), 3) assert_true(abc.is_codepoint_boundary(0)) assert_true(abc.is_codepoint_boundary(1)) assert_true(abc.is_codepoint_boundary(2)) assert_true(abc.is_codepoint_boundary(3)) ``` Only the index of the first byte in a multi-byte codepoint sequence is considered a codepoint boundary: ```mojo var thumb = StringSlice("👍") assert_equal(len(thumb), 4) assert_true(thumb.is_codepoint_boundary(0)) assert_false(thumb.is_codepoint_boundary(1)) assert_false(thumb.is_codepoint_boundary(2)) assert_false(thumb.is_codepoint_boundary(3)) ``` Visualization showing which bytes are considered codepoint boundaries, within a piece of text that includes codepoints whose UTF-8 representation requires, respectively, 1, 2, 3, and 4-bytes. The codepoint boundary byte indices are indicated by a vertical arrow (↑). For example, this diagram shows that a slice of bytes formed by the half-open range starting at byte 3 and extending up to but not including byte 6 (`[3, 6)`) is a valid UTF-8 sequence. ```text ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ a©➇𝄞 ┃ String ┣━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┫ ┃97┃ 169 ┃ 10119 ┃ 119070 ┃ Unicode Codepoints ┣━━╋━━━┳━━━╋━━━┳━━━┳━━━╋━━━┳━━━┳━━━┳━━━┫ ┃97┃194┃169┃226┃158┃135┃240┃157┃132┃158┃ UTF-8 Bytes ┗━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┛ 0 1 2 3 4 5 6 7 8 9 10 ↑ ↑ ↑ ↑ ↑ ``` The following program verifies the above diagram: ```mojo from testing import assert_true, assert_false var text = StringSlice("a©➇𝄞") assert_true(text.is_codepoint_boundary(0)) assert_true(text.is_codepoint_boundary(1)) assert_false(text.is_codepoint_boundary(2)) assert_true(text.is_codepoint_boundary(3)) assert_false(text.is_codepoint_boundary(4)) assert_false(text.is_codepoint_boundary(5)) assert_true(text.is_codepoint_boundary(6)) assert_false(text.is_codepoint_boundary(7)) assert_false(text.is_codepoint_boundary(8)) assert_false(text.is_codepoint_boundary(9)) assert_true(text.is_codepoint_boundary(10)) ``` **Args:** * ​index (`UInt`): An index into the underlying byte representation of the string. **Returns:** A boolean indicating if `index` gives the position of the first byte in a UTF-8 codepoint sequence, or is at the end of the string. ### `startswith` `startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Verify if the `StringSlice` starts with the specified prefix between start and end positions. The `start` and `end` positions must be offsets given in bytes, and must be codepoint boundaries. **Args:** * ​prefix (`StringSlice[origin]`): The prefix to check. * ​start (`Int`): The start offset in bytes from which to check. * ​end (`Int`): The end offset in bytes from which to check. **Returns:** True if the `self[start:end]` is prefixed by the input prefix. ### `endswith` `endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool` Verify if the `StringSlice` end with the specified suffix between start and end positions. The `start` and `end` positions must be offsets given in bytes, and must be codepoint boundaries. **Args:** * ​suffix (`StringSlice[origin]`): The suffix to check. * ​start (`Int`): The start offset in bytes from which to check. * ​end (`Int`): The end offset in bytes from which to check. **Returns:** True if the `self[start:end]` is suffixed by the input suffix. ### `removeprefix` `removeprefix(self, prefix: StringSlice[origin], /) -> Self` Returns a new string with the prefix removed if it was present. Examples: ```mojo print(StringSlice('TestHook').removeprefix('Test')) # 'Hook' print(StringSlice('BaseTestCase').removeprefix('Test')) # 'BaseTestCase' ``` **Args:** * ​prefix (`StringSlice[origin]`): The prefix to remove from the string. **Returns:** `string[len(prefix):]` if the string starts with the prefix string, or a copy of the original string otherwise. ### `removesuffix` `removesuffix(self, suffix: StringSlice[origin], /) -> Self` Returns a new string with the suffix removed if it was present. Examples: ```mojo print(StringSlice('TestHook').removesuffix('Hook')) # 'Test' print(StringSlice('BaseTestCase').removesuffix('Test')) # 'BaseTestCase' ``` **Args:** * ​suffix (`StringSlice[origin]`): The suffix to remove from the string. **Returns:** `string[:-len(suffix)]` if the string ends with the suffix string, or a copy of the original string otherwise. ### `format` `format[*Ts: Stringable & Representable](self, *args: *Ts) -> String` Produce a formatted string using the current string as a template. The template, or "format string" can contain literal text and/or replacement fields delimited with curly braces (`{}`). Returns a copy of the format string with the replacement fields replaced with string representations of the `args` arguments. For more information, see the discussion in the [`format` module](/mojo/stdlib/collections/string/format/). Examples: ```mojo # Manual indexing: print(StringSlice("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo # Automatic indexing: print(StringSlice("{} {}").format(True, "hello world")) # True hello world ``` **Parameters:** * ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable` and `Stringable` (to be changed and made more flexible). **Args:** * ​\*args (`*Ts`): The substitution values. **Returns:** The template with the given values substituted. ### `find` `find(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset in bytes of the first occurrence of `substr` starting at `start`. If not found, returns `-1`. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset in bytes from which to find. Must be a codepoint boundary. **Returns:** The offset in bytes of `substr` relative to the beginning of the string. ### `rfind` `rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int` Finds the offset in bytes of the last occurrence of `substr` starting at `start`. If not found, returns `-1`. **Args:** * ​substr (`StringSlice[origin]`): The substring to find. * ​start (`Int`): The offset in bytes from which to find. Must be a valid codepoint boundary. **Returns:** The offset in bytes of `substr` relative to the beginning of the string. ### `isspace` `isspace(self) -> Bool` Determines whether every character in the given StringSlice is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines): `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. Example: Check if a string contains only whitespace: ```mojo from testing import assert_true, assert_false # An empty string is not considered to contain only whitespace chars: assert_false(StringSlice("").isspace()) # ASCII space characters assert_true(StringSlice(" ").isspace()) assert_true(StringSlice(" ").isspace()) # Contains non-space characters assert_false(StringSlice(" abc ").isspace()) ``` **Returns:** True if the whole StringSlice is made up of whitespace characters listed above, otherwise False. ### `isnewline` `isnewline[single_character: Bool = False](self) -> Bool` Determines whether every character in the given StringSlice is a python newline character. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Parameters:** * ​single\_character (`Bool`): Whether to evaluate the stringslice as a single unicode character (avoids overhead when already iterating). **Returns:** True if the whole StringSlice is made up of whitespace characters listed above, otherwise False. ### `splitlines` `splitlines[O: ImmutableOrigin, //](self: StringSlice[O], keepends: Bool = False) -> List[StringSlice[O]]` Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`. **Parameters:** * ​O (`ImmutableOrigin`): The immutable origin. **Args:** * ​keepends (`Bool`): If True, line breaks are kept in the resulting strings. **Returns:** A List of Strings containing the input split by line boundaries. ### `count` `count(self, substr: StringSlice[origin]) -> Int` Return the number of non-overlapping occurrences of substring `substr` in the string. If sub is empty, returns the number of empty strings between characters which is the length of the string plus one. **Args:** * ​substr (`StringSlice[origin]`): The substring to count. **Returns:** The number of occurrences of `substr`. ### `is_ascii_digit` `is_ascii_digit(self) -> Bool` A string is a digit string if all characters in the string are digits and there is at least one character in the string. Note that this currently only works with ASCII strings. **Returns:** True if all characters are digits and it's not empty else False. ### `isupper` `isupper(self) -> Bool` Returns True if all cased characters in the string are uppercase and there is at least one cased character. **Returns:** True if all cased characters in the string are uppercase and there is at least one cased character, False otherwise. ### `islower` `islower(self) -> Bool` Returns True if all cased characters in the string are lowercase and there is at least one cased character. **Returns:** True if all cased characters in the string are lowercase and there is at least one cased character, False otherwise. ### `lower` `lower(self) -> String` Returns a copy of the string with all cased characters converted to lowercase. **Returns:** A new string where cased letters have been converted to lowercase. ### `upper` `upper(self) -> String` Returns a copy of the string with all cased characters converted to uppercase. **Returns:** A new string where cased letters have been converted to uppercase. ### `is_ascii_printable` `is_ascii_printable(self) -> Bool` Returns True if all characters in the string are ASCII printable. Note that this currently only works with ASCII strings. **Returns:** True if all characters are printable else False. ### `rjust` `rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string right justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns right justified string, or self if width is not bigger than self length. ### `ljust` `ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string left justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns left justified string, or self if width is not bigger than self length. ### `center` `center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String` Returns the string center justified in a string of specified width. **Args:** * ​width (`Int`): The width of the field containing the string. * ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character. **Returns:** Returns center justified string, or self if width is not bigger than self length. ### `join` `join[T: Copyable & Movable & Writable](self, elems: List[T, hint_trivial_type]) -> String` Joins string elements using the current string as a delimiter. **Parameters:** * ​T (`Copyable & Movable & Writable`): The type of the elements, must implement the `Copyable`, `Movable` and `Writable` traits. **Args:** * ​elems (`List[T, hint_trivial_type]`): The input values. **Returns:** The joined string. `join[*Ts: Writable](self: StringSlice[StaticConstantOrigin], *elems: *Ts) -> String` Joins string elements using the current string as a delimiter. **Parameters:** * ​\*Ts (`Writable`): The types of the elements. **Args:** * ​\*elems (`*Ts`): The input values. **Returns:** The joined string. --- ## Structs A Mojo struct is a data structure that allows you to encapsulate fields and methods that operate on an abstraction, such as a data type or an object. **Fields** are variables that hold data relevant to the struct, and **methods** are functions inside a struct that generally act upon the field data. For example, if you're building a graphics program, you can use a struct to define an `Image` that has fields to store information about each image (such as the pixels) and methods that perform actions on it (such as rotate it). For the most part, Mojo's struct format is designed to provide a static, memory-safe data structure for high-level data types used in programs. For example, all the data types in Mojo's standard library (such as `Int`, `Bool`, `String`, and `Tuple`) are defined as structs. If you understand how [functions](/mojo/manual/functions) and [variables](/mojo/manual/variables) work in Mojo, you probably noticed that Mojo is designed to provide dynamic programming features in a `def` function while enforcing stronger code safety in `fn` functions. When it comes to structs, Mojo leans toward the safe side: You can still choose whether to use either `def` or `fn` declarations for methods, but all fields must be declared with `var`. ## Struct definition You can define a simple struct called `MyPair` with two fields like this: ```mojo struct MyPair: var first: Int var second: Int ``` However, you can't instantiate this struct because it has no constructor method. So here it is with a constructor to initialize the two fields: ```mojo struct MyPair: var first: Int var second: Int fn __init__(out self, first: Int, second: Int): self.first = first self.second = second ``` Notice that the first argument in the `__init__()` method is `out self`. You'll have a `self` argument as the first argument on all struct methods. It references the current struct instance (it allows code in the method to refer to "itself"). *When you call the constructor, you never pass a value for `self`—Mojo passes it in automatically.* The `out` portion of `out self` is an [argument convention](/mojo/manual/values/ownership#argument-conventions) that declares `self` as a mutable reference that starts out as uninitialized and must be initialized before the function returns. The `__init__()` method is one of many [special methods](#special-methods) (also known as "dunder methods" because they have *d*ouble *under*scores) with pre-determined names. :::note You can't assign values when you declare fields. You must initialize all of the struct's fields in the constructor. (If you try to leave a field uninitialized, the code won't compile.) ::: Once you have a constructor, you can create an instance of `MyPair` and set the fields: ```mojo var mine = MyPair(2,4) print(mine.first) ``` ```output 2 ``` ## Methods In addition to special methods like `__init__()`, you can add any other method you want to your struct. For example: ```mojo struct MyPair: var first: Int var second: Int fn __init__(out self, first: Int, second: Int): self.first = first self.second = second fn get_sum(self) -> Int: return self.first + self.second ``` ```mojo var mine = MyPair(6, 8) print(mine.get_sum()) ``` ```output 14 ``` Notice that `get_sum()` also uses the `self` argument, because this is the only way you can access the struct's fields in a method. The name `self` is just a convention, and you can use any name you want to refer to the struct instance that is always passed as the first argument. Methods that take the implicit `self` argument are called *instance methods* because they act on an instance of the struct. :::note The `self` argument in a struct method is the only argument in an `fn` function that does not require a type. You can include it if you want, but you can elide it because Mojo already knows its type (`MyPair` in this case). ::: ### `fn` versus `def` in struct methods Struct methods can be declared with either the `def` or `fn` keywords. One important difference is that an `fn` function without the `raises` keyword can't raise an error. When you call a function that *can* raise an error from inside a method that *can't* raise an error, Mojo requires you to handle any errors, as described in [Errors, error handling, and context managers](/mojo/manual/errors). If you're writing code that you expect to use widely or distribute as a package, you may want to use `fn` functions for APIs that can't raise an error to limit the number of places users need to add error handling code. A struct's `__del__()` method, or destructor, **must** be a non-raising method, so it's always declared with `fn` (and without the `raises` keyword). ### Static methods A struct can also have *static methods*. A static method can be called without creating an instance of the struct. Unlike instance methods, a static method doesn't receive the implicit `self` argument, so it can't access any fields on the struct. To declare a static method, use the `@staticmethod` decorator and don't include a `self` argument: ```mojo struct Logger: fn __init__(out self): pass @staticmethod fn log_info(message: String): print("Info: ", message) ``` You can invoke a static method by calling it on the type (in this case, `Logger`). You can also call it on an instance of the type. Both forms are shown below: ```mojo Logger.log_info("Static method called.") var l = Logger() l.log_info("Static method called from instance.") ``` ```output Info: Static method called. Info: Static method called from instance. ``` ## Structs compared to classes If you're familiar with other object-oriented languages, then structs might sound a lot like classes, and there are some similarities, but also some important differences. Eventually, Mojo will also support classes to match the behavior of Python classes. So, let's compare Mojo structs to Python classes. They both support methods, fields, operator overloading, decorators for metaprogramming, and more, but their key differences are as follows: * Python classes are dynamic: they allow for dynamic dispatch, monkey-patching (or “swizzling”), and dynamically binding instance fields at runtime. * Mojo structs are static: they are bound at compile-time (you cannot add methods at runtime). Structs allow you to trade flexibility for performance while being safe and easy to use. * Mojo structs do not support inheritance ("sub-classing"), but a struct can implement [traits](/mojo/manual/traits). * Python classes support class attributes—values that are shared by all instances of the class, equivalent to class variables or static data members in other languages. * Mojo structs don't support static data members. Syntactically, the biggest difference compared to a Python class is that all fields in a struct must be explicitly declared with `var`. In Mojo, the structure and contents of a struct are set at compile time and can't be changed while the program is running. Unlike in Python, where you can add, remove, or change attributes of an object on the fly, Mojo doesn't allow that for structs. However, the static nature of structs helps Mojo run your code faster. The program knows exactly where to find the struct's information and how to use it without any extra steps or delays at runtime. Mojo's structs also work really well with features you might already know from Python, like operator overloading (which lets you change how math symbols like `+` and `-` work with your own data, using [special methods](#special-methods)). As mentioned above, all Mojo's standard types (`Int`, `String`, etc.) are made using structs, rather than being hardwired into the language itself. This gives you more flexibility and control when writing your code, and it means you can define your own types with all the same capabilities (there's no special treatment for the standard library types). ## Special methods Special methods (or "dunder methods") such as `__init__()` are pre-determined method names that you can define in a struct to perform a special task. Although it's possible to call special methods with their method names, the point is that you never should, because Mojo automatically invokes them in circumstances where they're needed (which is why they're also called "magic methods"). For example, Mojo calls the `__init__()` method when you create an instance of the struct; and when Mojo destroys the instance, it calls the `__del__()` method (if it exists). Even operator behaviors that appear built-in (`+`, `<`, `==`, `|`, and so on) are implemented as special methods that Mojo implicitly calls upon to perform operations or comparisons on the type that the operator is applied to. Mojo supports a long list of special methods; far too many to discuss here, but they generally match all of [Python's special methods](https://docs.python.org/3/reference/datamodel#special-method-names) and they usually accomplish one of two types of tasks: * Operator overloading: A lot of special methods are designed to overload operators such as `<` (less-than), `+` (add), and `|` (or) so they work appropriately with each type. For example, look at the methods listed for Mojo's [`Int` type](/mojo/stdlib/builtin/int/Int). One such method is `__lt__()`, which Mojo calls to perform a less-than comparison between two integers (for example, `num1 < num2`). * Lifecycle event handling: These special methods deal with the lifecycle and value ownership of an instance. For example, `__init__()` and `__del__()` demarcate the beginning and end of an instance lifetime, and other special methods define the behavior for other lifecycle events such as how to copy or move a value. You can learn all about the lifecycle special methods in the [Value lifecycle](/mojo/manual/lifecycle/) section. However, most structs are simple aggregations of other types, so unless your type requires custom behaviors when an instance is created, copied, moved, or destroyed, you can synthesize the essential lifecycle methods you need (and save yourself some time) by adding the `@value` decorator. ### `@value` decorator When you add the [`@value` decorator](/mojo/manual/decorators/value) to a struct, Mojo will synthesize the essential lifecycle methods so your object provides full value semantics. Specifically, it generates the `__init__()`, `__copyinit__()`, and `__moveinit__()` methods, which allow you to construct, copy, and move your struct type in a manner that's value semantic and compatible with Mojo's ownership model. For example: ```mojo @value struct MyPet: var name: String var age: Int ``` Mojo will notice that you don't have a member-wise initializer, a move constructor, or a copy constructor, and it will synthesize these for you as if you had written: ```mojo struct MyPet: var name: String var age: Int fn __init__(out self, owned name: String, age: Int): self.name = name^ self.age = age fn __copyinit__(out self, existing: Self): self.name = existing.name self.age = existing.age fn __moveinit__(out self, owned existing: Self): self.name = existing.name^ self.age = existing.age ``` Without the copy and move constructors, the following code would not work because Mojo would not know how to copy an instance of `MyPet`: ```mojo var dog = MyPet("Charlie", 5) var poodle = dog print(poodle.name) ``` ```output Charlie ``` When you add the `@value` decorator, Mojo synthesizes each special method above only if it doesn't exist already. That is, you can still implement a custom version of each method. In addition to the `out` argument convention you already saw with `__init__()`, this code also introduces `owned`, which is another argument convention that ensures the argument has unique ownership of the value. For more detail, see the section about [value ownership](/mojo/manual/values/ownership). --- ## Structured output import TutorialStack from '@site/src/components/TutorialStack'; MAX supports the generation of structured output using [XGrammar](https://github.com/mlc-ai/xgrammar) as a backend. Structured output, also sometimes referred to as constrained decoding, allows users to enforce specific output formats, ensuring structured and predictable responses from a model. :::note Structured output is compatible with GPU deployments and MAX models only. Support for PyTorch models and CPU deployments is in progress. ::: ## When to use structured output If you want to structure a model's output when it responds to a user, then you should use a structured output `response_format`. If you are connecting a model to tools, functions, data, or other systems, then you should use [function calling](/max/serve/function-calling) instead of structured outputs. ## How structured output works To use structured output, use the `--enable-structured-output` flag when serving your model with the `max` CLI. ```bash max serve \ --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \ --enable-structured-output ``` Then, when making inference requests, you must specify a `response_format` JSON schema. Both the `/chat/completions` and `/completions` API endpoints are compatible with structured output. ### JSON schema To specify a structured output within your inference request, use the following format: :::note You can increase the accuracy of structured output responses by mentioning JSON output specifications in your system prompt. ::: ``` curl -N http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model"="modularai/Llama-3.1-8B-Instruct-GGUF", "messages"=[ {"role": "system", "content": "You are a helpful math tutor. Guide the user through the solution step by step. Provide your guidance in JSON format."}, {"role": "user", "content": "How can I solve 8x + 7 = -23"} ], response_format={ "type": "json_schema", "json_schema": { "name": "math_response", "schema": { "type": "object", "properties": { "steps": { "type": "array", "items": { "type": "object", "properties": { "explanation": {"type": "string"}, "output": {"type": "string"} }, "required": ["explanation", "output"], "additionalProperties": False } }, "final_answer": {"type": "string"} }, "required": ["steps", "final_answer"], "additionalProperties": False } } } ``` ### Schema validation You can also define your structured output using the Pydantic [`BaseModel`](https://docs.pydantic.dev/latest/api/base_model/) to validate your JSON schema in Python. Here's an example: ```python from pydantic import BaseModel from openai import OpenAI client = OpenAI() class CalendarEvent(BaseModel): name: str date: str participants: list[str] completion = client.beta.chat.completions.parse( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[ {"role": "system", "content": "Extract the event information."}, {"role": "user", "content": "Alice and Bob are going to a movie on Friday."}, ], response_format=CalendarEvent, ) event = completion.choices[0].message.parsed ``` ### Supported models All text generation models support structured output with MAX. As new models are added, they will also be compatible with structured output. This functionality is implemented at the pipeline level, ensuring consistency across different models. However, structured output currently doesn't support PyTorch models or CPU deployments—only [MAX models](/max/model-formats#max-graph) deployed on GPUs. ## Next steps For more examples, you can explore structured output [recipes](https://builds.modular.com/?category=recipes&tag=structured-output). After defining your output structure, you can explore deploying your workflow on GPUs. export const tutorials = [ 'max-serve-local-to-cloud', 'deploy-max-serve-on-kubernetes', ]; --- ## stx `stx(gpr: Int)` --- ## sty `sty(gpr: Int)` --- ## stz `stz(gpr: Int)` --- ## stzi `stzi(gpr: Int)` --- ## sub `sub(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]` --- ## sublayout `sublayout(layout: Layout, *modes: Int) -> Layout` Creates a sublayout by selecting specific dimensions from a layout. This function extracts a subset of dimensions from a layout to create a new layout with lower rank. For example, from a 3D layout, you could extract a 2D layout containing only the first and third dimensions. Example: From a layout with shape (3,4,5), sublayout(layout, 0, 2) would create a layout with shape (3,5). **Args:** * ​layout (`Layout`): The source layout to extract dimensions from. * ​\*modes (`Int`): The indices of dimensions to include in the sublayout. **Returns:** A new layout containing only the specified dimensions. --- ## SubMatmulConfig `struct SubMatmulConfig` Static configuration of sub-matrices in parallel matmul. ## Fields * ​offset (`IndexList[3]`): * ​shape (`IndexList[3]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `is_valid` `is_valid(self) -> Bool` --- ## subprocess Implements the subprocess package. ## Modules * [​`subprocess`](/mojo/stdlib/subprocess/subprocess/): Implements the subprocess package. --- ## subprocess Implements the subprocess package. ## Functions * [​`run`](/mojo/stdlib/subprocess/subprocess/run): Runs the specified command and returns the output as a string. --- ## sum `sum(t: IntTuple[origin]) -> Int` Calculate the sum of all values in an `IntTuple`. This function recursively computes the sum of all integer values in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to sum. **Returns:** The sum of all integer values, or `UNKNOWN_VALUE` if any value in the tuple is `UNKNOWN_VALUE`. --- ## sum `sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Computes sum reduction along specified axis. Reduces the input tensor by summing elements along the specified axis and stores the result in the output tensor. Example: ```mojo from layout import LayoutTensor, Layout from layout.math import sum data = InlineArray[Int32, 6](0, 1, 2, 3, 4, 5) tensor = LayoutTensor[DType.int32, Layout.row_major(2, 3)](data) print(tensor) print("-----") print(sum[0](tensor)) ``` Output: ```plaintext 0 1 2 3 4 5 ----- 3 5 7 ``` . **Constraints:** All tensors must have statically known shapes. `out.rank` must equal `inp.rank - 1`. Non-reduction dimensions must match between inp and out. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to sum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum. * ​out (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store sum results. `sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]` Computes sum reduction along specified axis, returning a new tensor. Reduces the input tensor by summing elements along the specified axis and returns a new tensor with the results. **Constraints:** All tensors must have statically known shapes. Result will have rank equal to `inp.rank` - 1. Non-reduction dimensions in the result match the input. Currently only supports rank-2 inputs. **Parameters:** * ​axis (`Int`): The axis to sum along. **Args:** * ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum. **Returns:** A new tensor containing the sum values along the specified axis. --- ## sum `sum(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]` Computes the sum of buffer elements. **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. **Returns:** The sum of the buffer elements. `sum[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])` Computes the sum across reduce\_axis of an NDBuffer. **Parameters:** * ​reduce\_axis (`Int`): The axis to reduce across. **Args:** * ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer. * ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer. `sum[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())` Computes the sum across the input and output shape. This performs the sum computation on the domain specified by `input_shape`, loading the inputs using the `input_fn`. The results are stored using the `output_fn`. **Parameters:** * ​type (`DType`): The type of the input and output. * ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input. * ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input\_shape (`IndexList[size]`): The input shape. * ​reduce\_dim (`Int`): The axis to perform the sum on. * ​context (`DeviceContextPtr`): The pointer to DeviceContext. --- ## sum `sum[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]` Computes the sum of values across all threads in a block. Performs a parallel reduction using warp-level operations and shared memory to find the global sum across all threads in the block. **Parameters:** * ​type (`DType`): The data type of the SIMD elements. * ​width (`Int`): The number of elements in each SIMD vector. * ​block\_size (`Int`): The total number of threads in the block. * ​broadcast (`Bool`): If True, the final sum is broadcast to all threads in the block. If False, only the first thread will have the complete sum. **Args:** * ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to the sum. **Returns:** If broadcast is True, each thread in the block will receive the final sum. Otherwise, only the first thread will have the complete sum. --- ## sum `sum[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]` Computes the sum of values across all lanes in a warp. This is a convenience wrapper around lane\_group\_sum\_and\_broadcast that operates on the entire warp. It performs a parallel reduction using warp shuffle operations to find the global sum across all lanes in the warp. **Parameters:** * ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32). * ​simd\_width (`Int`): The number of elements in the SIMD vector. **Args:** * ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum. **Returns:** A SIMD value where all lanes contain the sum found across the entire warp. The sum is broadcast to all lanes. `sum[intermediate_type: DType, *, reduction_method: ReductionMethod, output_type: DType](x: SIMD[dtype, size]) -> SIMD[output_type, 1]` Performs a warp-level reduction to compute the sum of values across threads. This function provides two reduction methods: 1. Warp shuffle: Uses warp shuffle operations to efficiently sum values across threads 2. Tensor core: Leverages tensor cores for high-performance reductions, with type casting The tensor core method will cast the input to the specified intermediate type before reduction to ensure compatibility with tensor core operations. The warp shuffle method requires the output type to match the input type. **Constraints:** * For warp shuffle reduction, output\_type must match the input value type. * For tensor core reduction, input will be cast to intermediate\_type. **Parameters:** * ​intermediate\_type (`DType`): The data type to cast to when using tensor core reduction. * ​reduction\_method (`ReductionMethod`): `WARP` for warp shuffle or `TENSOR_CORE` for tensor core reduction. * ​output\_type (`DType`): The desired output data type for the reduced value. **Args:** * ​x (`SIMD[dtype, size]`): The SIMD value to reduce across the warp. **Returns:** A scalar containing the sum of the input values across all threads in the warp, cast to the specified output type. --- ## swap Implements the built-in `swap` function. These are Mojo built-ins, so you don't need to import them. ## Functions * [​`swap`](/mojo/stdlib/builtin/swap/swap): Swaps the two given arguments. --- ## swap `swap[T: Movable](mut lhs: T, mut rhs: T)` Swaps the two given arguments. **Parameters:** * ​T (`Movable`): Constrained to Movable types. **Args:** * ​lhs (`T`): Argument value swapped with rhs. * ​rhs (`T`): Argument value swapped with lhs. --- ## swilu `swilu[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]` --- ## swishGLU `swishGLU[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], ctx: DeviceContextPtr)` Reference: GLU Variants Improve Transformer by Noam Shazeer The implementation follows cutlass, using one kernel invocation and writing to the destination once. --- ## swizzle Defines swizzle layouts for optimizing memory access patterns. This module is designed for use in shared memory, especially in GPU kernels, to reduce bank conflicts. It provides tools to create and apply swizzle transformations to memory indices. Swizzling rearranges memory access order to distribute accesses across different memory banks. This mitigates bank contention and improves memory access efficiency. Module components: * `Swizzle` struct: Represents a swizzle transformation with configurable bits, base, and shift parameters. * Helper functions: `make_ldmatrix_swizzle`, `make_swizzle` create predefined swizzle patterns. These are optimized for scenarios like `ldmatrix` instructions and general 2D memory access. * `ComposedLayout` struct: Combines a base layout with a swizzle layout for complex memory access optimizations. Primary use case: GPU kernel development where shared memory bank conflicts can degrade performance. Applying swizzle layouts optimizes memory access patterns for higher throughput. ## Structs * [​`ComposedLayout`](./ComposedLayout): Layout composed of two layouts applied sequentially. * [​`Swizzle`](./Swizzle): Swizzle functor for memory access pattern optimization. ## Functions * [​`eval_composed`](./eval_composed): Evaluate a composed layout with swizzle. * [​`make_ldmatrix_swizzle`](./make_ldmatrix_swizzle): Make swizzle to avoid bank conflict for ldmatrix ops. * [​`make_swizzle`](./make_swizzle): Create a 2D swizzle to avoid bank conflicts. * [​`shiftl`](./shiftl): Shift left or right based on sign of shift amount. * [​`shiftr`](./shiftr): Shift right or left based on sign of shift amount. --- ## Swizzle `@register_passable(trivial)` `struct Swizzle` Swizzle functor for memory access pattern optimization. Implements a swizzling pattern to reduce bank conflicts in shared memory accesses. It XORs specific bits of memory indices based on configurable parameters. Swizzle operation: Given index `i`, and Swizzle\[bits, base, shift]: 1. Extract `bits` number of bits from `i` starting from position `base + max(0, shift)`. Let's call this `YYY`. 2. Extract `bits` number of bits from `i` starting from position `base - min(0, shift)`. Let's call this `ZZZ`. 3. Result is `i ^ (YYY shifted by 'shift' positions)`. Example (Swizzle\[2, 0, 3]): Input index bits: `xxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxxx` Output index bits: `xxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxxx` where `AA = ZZ ^ YY`. Attributes: bits (Int): Number of bits in the mask (YYY). base (Int): Number of least significant bits to keep constant. shift (Int): Shift distance for the mask (positive: right, negative: left). yyy\_mask (Int): Mask for the bits to be shifted (YYY). zzz\_mask (Int): Mask for the target bits (ZZZ). ## Fields * ​bits (`Int`): Number of bits in the mask. * ​base (`Int`): Number of least significant bits to keep constant. * ​shift (`Int`): Distance to shift the mask (pos right, neg left). * ​yyy\_mask (`Int`): Mask for the bits to be shifted. * ​zzz\_mask (`Int`): Mask for the target bits. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `LayoutTrait`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `has_shape` `alias has_shape = False` Indicates if layout has shape. Swizzle always False. ## Methods ### `__init__` `__init__(bits: Int, base: Int, shift: Int) -> Self` Initialize a Swizzle object. Configures the swizzle operation based on bits, base, and shift parameters. **Args:** * ​bits (`Int`): Number of bits in the mask. * ​base (`Int`): Least significant bits to keep constant. * ​shift (`Int`): Distance to shift the mask. ### `__call__` `__call__(self, index: IntTuple[origin]) -> Int` Apply swizzle to an IntTuple index. Unwraps the IntTuple and applies the swizzle to the integer value. **Args:** * ​index (`IntTuple[origin]`): The IntTuple index to swizzle. **Returns:** The swizzled index value. `__call__(self, offset: Int) -> Int` Apply swizzle to an integer offset. Performs the swizzle operation on an integer offset to rearrange memory access patterns. **Args:** * ​offset (`Int`): The integer offset to swizzle. **Returns:** The swizzled offset value. `__call__(self, offset: SIMD[dtype, 1]) -> SIMD[dtype, 1]` Apply swizzle to a scalar offset. Scalar version of the swizzle operation. Applies swizzle to a scalar offset. **Args:** * ​offset (`SIMD[dtype, 1]`): The scalar offset to swizzle. **Returns:** The swizzled scalar value. ### `size` `size(self) -> Int` Get the size of the swizzle pattern. Calculates the size of the memory region affected by the swizzle pattern. **Returns:** The size of the swizzle pattern. ### `cosize` `cosize(self) -> Int` Get the cosize of the swizzle pattern. Cosize is the same as size for swizzle layouts, representing the output size. **Returns:** The cosize of the swizzle pattern (same as size). ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write the swizzle parameters to a writer. Outputs the swizzle parameters (bits, base, shift) in a tuple format. **Parameters:** * ​W (`Writer`): The writer type that implements the Writer trait. **Args:** * ​writer (`W`): The writer to write to. ### `__str__` `__str__(self) -> String` Convert the swizzle to a string representation. **Returns:** String representation of the swizzle parameters. --- ## sync This module provides GPU synchronization primitives and barriers. The module includes: * Block-level synchronization barriers (barrier()) * Warp-level synchronization (syncwarp()) * Memory barriers (mbarrier) for NVIDIA GPUs * Instruction scheduling controls for AMD GPUs * Asynchronous copy and bulk transfer synchronization The synchronization primitives help coordinate execution between threads within thread blocks and warps, and manage memory consistency across different memory spaces. ## Structs * [​`AMDScheduleBarrierMask`](/mojo/stdlib/gpu/sync/AMDScheduleBarrierMask): Represents different instruction scheduling masks for AMDGPU scheduling instructions. ## Functions * [​`async_copy_arrive`](/mojo/stdlib/gpu/sync/async_copy_arrive): Makes a memory barrier track all prior async copy operations from this thread. * [​`barrier`](/mojo/stdlib/gpu/sync/barrier): Performs a synchronization barrier at the block level. * [​`cp_async_bulk_commit_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_commit_group): Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group. * [​`cp_async_bulk_wait_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_wait_group): Waits for completion of asynchronous bulk memory transfer groups. * [​`mbarrier_arrive`](/mojo/stdlib/gpu/sync/mbarrier_arrive): Signal thread arrival at a shared memory barrier. * [​`mbarrier_arrive_expect_tx_shared`](/mojo/stdlib/gpu/sync/mbarrier_arrive_expect_tx_shared): Configure a shared memory barrier to expect additional async transactions. * [​`mbarrier_init`](/mojo/stdlib/gpu/sync/mbarrier_init): Initialize a shared memory barrier for synchronizing multiple threads. * [​`mbarrier_test_wait`](/mojo/stdlib/gpu/sync/mbarrier_test_wait): Test if all threads have arrived at the memory barrier. * [​`mbarrier_try_wait_parity_shared`](/mojo/stdlib/gpu/sync/mbarrier_try_wait_parity_shared): Wait for completion of a barrier phase with timeout. * [​`named_barrier`](/mojo/stdlib/gpu/sync/named_barrier): Performs a named synchronization barrier at the block level. * [​`schedule_barrier`](/mojo/stdlib/gpu/sync/schedule_barrier): Controls instruction scheduling across a barrier point in AMD GPU code. * [​`schedule_group_barrier`](/mojo/stdlib/gpu/sync/schedule_group_barrier): Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups. * [​`syncwarp`](/mojo/stdlib/gpu/sync/syncwarp): Synchronizes threads within a warp using a barrier. --- ## sync_parallelize `sync_parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)` Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. `sync_parallelize[origins: origin.set, //, func: fn(Int) raises capturing -> None](num_work_items: Int)` Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete. TODO: Currently exceptions raised by func will cause a trap rather than be propagated back to the caller. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn(Int) raises capturing -> None`): The function to invoke. **Args:** * ​num\_work\_items (`Int`): Number of parallel tasks. --- ## syncwarp `syncwarp(mask: Int = -1)` Synchronizes threads within a warp using a barrier. This function creates a synchronization point where threads in a warp must wait until all threads specified by the mask reach this point. On NVIDIA GPUs, it uses warp-level synchronization primitives. On AMD GPUs, this is a no-op since threads execute in lock-step. Note: * On NVIDIA GPUs, this maps to the nvvm.bar.warp.sync intrinsic. * On AMD GPUs, this is a no-op since threads execute in lock-step. * Threads not participating in the sync must still execute the instruction. **Args:** * ​mask (`Int`): An integer bitmask specifying which lanes (threads) in the warp should be synchronized. Each bit corresponds to a lane, with bit i controlling lane i. A value of 1 means the lane participates in the sync, 0 means it does not. Default value of -1 (all bits set) synchronizes all lanes. --- ## sys Implements the sys package. ## Modules * [​`arg`](/mojo/stdlib/sys/arg/): Implements functions and variables for interacting with execution and system environment. * [​`compile`](/mojo/stdlib/sys/compile/): Implements functions that return compile-time information. * [​`debug`](/mojo/stdlib/sys/debug/): This module includes the debug hook functions. * [​`ffi`](/mojo/stdlib/sys/ffi/): Implements a foreign functions interface (FFI). * [​`info`](/mojo/stdlib/sys/info/): Implements methods for querying the host target info. * [​`intrinsics`](/mojo/stdlib/sys/intrinsics/): Defines intrinsics. * [​`param_env`](/mojo/stdlib/sys/param_env/): Implements functions for retrieving compile-time defines. * [​`terminate`](/mojo/stdlib/sys/terminate/): This module includes the exit functions. --- ## tan `tan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the `tan` of the inputs. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input argument. **Returns:** The `tan` of the input. --- ## tanh `tanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Performs elementwise evaluation of the tanh function. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The vector to perform the elementwise tanh on. **Returns:** The result of the elementwise tanh operation. --- ## Task `struct Task[type: AnyType, origins: origin.set]` Represents an asynchronous task that will produce a value of the specified type. A Task encapsulates a coroutine that is executing asynchronously and will eventually produce a result. Tasks can be awaited in async functions or waited on in synchronous code. ## Parameters * ​type (`AnyType`): The type of value that this task will produce when completed. * ​origins (`origin.set`): The set of origins for the coroutine wrapped by this task. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, owned handle: Coroutine[type, origins])` Initialize a task with a coroutine. Takes ownership of the provided coroutine and sets up the task to receive its result when completed. **Args:** * ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred. ### `__del__` `__del__(owned self)` Destroy the memory associated with a task. This must be manually called when a task goes out of scope. ### `__await__` `__await__(self) -> ref [*[0,0]._result] type` Suspend the current async function until the task completes and its result becomes available. This function must be force inlined into the calling async function. This method enables the use of the 'await' keyword with Task objects in async functions. **Returns:** A reference to the result value produced by the task. ### `get` `get(self) -> ref [*[0,0]._result] type` Get the task's result value. Calling this on an incomplete task is undefined behavior. **Returns:** A reference to the result value produced by the task. ### `wait` `wait(self) -> ref [*[0,0]._result] type` Block the current thread until the future value becomes available. This method is used in synchronous code to wait for an asynchronous task to complete. Unlike `__await__`, this method does not suspend the current coroutine but instead blocks the entire thread. **Returns:** A reference to the result value produced by the task. --- ## TaskGroup `struct TaskGroup` A group of tasks that can be executed concurrently. TaskGroup manages a collection of coroutines that can be executed in parallel. It provides mechanisms to create, track, and wait for the completion of tasks. ## Fields * ​counter (`Atomic[index]`): Atomic counter tracking the number of active tasks in the group. * ​chain (`_Chain`): Chain used for asynchronous completion notification. * ​tasks (`List[_TaskGroupBox]`): Collection of tasks managed by this TaskGroup. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize a new TaskGroup with an empty task list and initialized chain. ### `__del__` `__del__(owned self)` Clean up resources associated with the TaskGroup. ### `__await__` `__await__(mut self)` Make TaskGroup awaitable in async contexts. This allows using 'await task\_group' syntax in async functions. ### `create_task` `create_task(mut self, owned task: Coroutine[None, origins])` Add a new task to the TaskGroup for execution. **Args:** * ​task (`Coroutine[None, origins]`): The coroutine to be executed as a task. ### `await_body_impl` `static await_body_impl(hdl: !co.routine, mut task_group: Self)` Implementation of the await functionality for TaskGroup. **Args:** * ​hdl (`!co.routine`): The coroutine handle to be awaited. * ​task\_group (`Self`): The TaskGroup to be awaited. ### `wait` `wait[origins: origin.set = {}](mut self)` Wait for all tasks in the `TaskGroup` to complete. This is a blocking call that returns only when all tasks have finished. **Parameters:** * ​origins (`origin.set`): The origin set for the wait operation. --- ## TaskGroupContext `@register_passable(trivial)` `struct TaskGroupContext` Context structure for task group operations. This structure holds a callback function and a pointer to a TaskGroup, allowing asynchronous operations to interact with their parent TaskGroup when they complete. ## Fields * ​callback (`fn(mut TaskGroup) -> None`): Callback function to be invoked on the TaskGroup when an operation completes. * ​task\_group (`UnsafePointer[TaskGroup]`): Pointer to the TaskGroup that owns or is associated with this context. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `tg_callback_fn_type` `alias tg_callback_fn_type = fn(mut TaskGroup) -> None` Type definition for callback functions that operate on TaskGroups. --- ## tc_reduce `tc_reduce[in_type: DType, simd_width: Int, //, out_type: DType](val: SIMD[in_type, simd_width]) -> SIMD[out_type, 1]` Performs tensor core based reduction on a SIMD vector. Note: Dispatches to either scalar or vector reduction implementation based on SIMD width. Supports various input/output type combinations using tensor core operations. **Parameters:** * ​in\_type (`DType`): The input data type of the SIMD vector elements. * ​simd\_width (`Int`): The width of the SIMD vector. * ​out\_type (`DType`): The output data type for the reduced result. **Args:** * ​val (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce. **Returns:** Scalar containing the reduced result. --- ## tc_reduce_gevm_4x `tc_reduce_gevm_4x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]` Performs a 4x GEVM reduction using tensor cores. Note: Currently only supports bfloat16 input to float32 output conversion. Uses tensor core matrix multiply-accumulate (MMA) operations for reduction. **Parameters:** * ​out\_type (`DType`): The output data type for the reduction result (must be float32). * ​in\_type (`DType`): The input data type of the vector to reduce (must be bfloat16). * ​simd\_width (`Int`): The width of the SIMD vector. **Args:** * ​val1 (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce. **Returns:** SIMD vector containing the reduced result. --- ## tc_reduce_gevm_8x `tc_reduce_gevm_8x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width], val2: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]` Performs an 8x GEVM reduction using tensor cores. Note: Currently only supports bfloat16 input to float32 output conversion. Uses tensor core matrix multiply-accumulate (MMA) operations for reduction. **Parameters:** * ​out\_type (`DType`): The output data type for the reduction result (must be float32). * ​in\_type (`DType`): The input data type of the vectors to reduce (must be bfloat16). * ​simd\_width (`Int`): The width of the SIMD vectors. **Args:** * ​val1 (`SIMD[in_type, simd_width]`): First input SIMD vector to reduce. * ​val2 (`SIMD[in_type, simd_width]`): Second input SIMD vector to reduce. **Returns:** SIMD vector containing the reduced result. --- ## tcgen05 This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions. ## Aliases ### `check_blackwell_constraint` `alias check_blackwell_constraint = constrained[::Bool,::StringSlice[::Bool[_has_blackwell_tcgen05(), __init__[__mlir_type.!kgen.string]("The tcgen05 instructions are only applicable on nVidia Blackwell (sm_100a, sm_101a) hardware."), ?]` ## Structs * [​`TensorMemory`](/mojo/stdlib/gpu/tcgen05/TensorMemory): A wrapper around tensor memory allocated for tcgen05 instructions. ## Functions * [​`tcgen05_alloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_alloc): Allocates tensor memory for use with tcgen05 instructions. * [​`tcgen05_cp`](/mojo/stdlib/gpu/tcgen05/tcgen05_cp): Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`. * [​`tcgen05_dealloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_dealloc): Deallocates tensor memory allocated by tcgen05\_alloc(). * [​`tcgen05_fence_after`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_after): Orders all the subsequent asynchronous `tcgen05` operations. * [​`tcgen05_fence_before`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_before): Orders all the prior asynchronous `tcgen05` operations. * [​`tcgen05_ld`](/mojo/stdlib/gpu/tcgen05/tcgen05_ld): Loads data from tensor memory into registers. * [​`tcgen05_load_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_load_wait): Waits for tensor memory loads to complete. * [​`tcgen05_release_allocation_lock`](/mojo/stdlib/gpu/tcgen05/tcgen05_release_allocation_lock): Releases the allocation lock for the current CTA group. * [​`tcgen05_st`](/mojo/stdlib/gpu/tcgen05/tcgen05_st): Stores data from registers into tensor memory. * [​`tcgen05_store_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_store_wait): Waits for tensor memory stores to complete. --- ## tcgen05_alloc `tcgen05_alloc[cta_group: SIMD[int32, 1]](ptr_tmem_addr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16], num_cols: SIMD[uint32, 1])` Allocates tensor memory for use with tcgen05 instructions. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. **Args:** * ​ptr\_tmem\_addr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Shared memory pointer to hold tensor memory address. * ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate. --- ## tcgen05_cp `tcgen05_cp[*, cta_group: SIMD[int32, 1], datapaths: Int, bits: Int, src_fmt: String = __init__[__mlir_type.!kgen.string](""), dst_fmt: String = __init__[__mlir_type.!kgen.string](""), multicast: String = __init__[__mlir_type.!kgen.string]("")](tmem_addr: SIMD[uint32, 1], s_desc: MMASmemDescriptor)` Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. * ​datapaths (`Int`): The first dimension of the shape. * ​bits (`Int`): The second dimension of the shape. * ​src\_fmt (`String`): Source format string. * ​dst\_fmt (`String`): Destination format string. * ​multicast (`String`): Multicast string. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory. * ​s\_desc (`MMASmemDescriptor`): Matrix descriptor for the copy operation. --- ## tcgen05_dealloc `tcgen05_dealloc[cta_group: SIMD[int32, 1]](tmem_addr: SIMD[uint32, 1], num_cols: SIMD[uint32, 1])` Deallocates tensor memory allocated by tcgen05\_alloc(). This function deallocates tensor memory that was previously allocated using tcgen05\_alloc(). The deallocation must be performed by the same CTA group that performed the allocation. **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory to deallocate. * ​num\_cols (`SIMD[uint32, 1]`): Number of columns in the tensor memory. --- ## tcgen05_fence_after `tcgen05_fence_after()` Orders all the subsequent asynchronous `tcgen05` operations. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tcgen05_fence_before `tcgen05_fence_before()` Orders all the prior asynchronous `tcgen05` operations. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tcgen05_ld `tcgen05_ld[*, datapaths: Int, bits: Int, repeat: Int, type: DType, pack: Bool, width: Int = (div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024) + -1) if (((bits * datapaths * repeat) , #lit.struct.extract, #lit.struct.extract), 1024) == 0) ^ True)) else div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024)](tmem_addr: SIMD[uint32, 1]) -> SIMD[type, width]` Loads data from tensor memory into registers. **Parameters:** * ​datapaths (`Int`): The first dimension of the shape. * ​bits (`Int`): The second dimension of the shape. * ​repeat (`Int`): The repeat factor. * ​type (`DType`): The data type to load. * ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register. * ​width (`Int`): The nubmer elements in the result vector. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to load from. **Returns:** The SIMD register containing the loaded data. --- ## tcgen05_load_wait `tcgen05_load_wait()` Waits for tensor memory loads to complete. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tcgen05_release_allocation_lock `tcgen05_release_allocation_lock[cta_group: SIMD[int32, 1]]()` Releases the allocation lock for the current CTA group. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). **Parameters:** * ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID. --- ## tcgen05_st `tcgen05_st[type: DType, width: Int, //, *, datapaths: Int, bits: Int, repeat: Int, pack: Bool](tmem_addr: SIMD[uint32, 1], data: SIMD[type, width])` Stores data from registers into tensor memory. **Parameters:** * ​type (`DType`): The data type to load. * ​width (`Int`): The number of elements in the data vector. * ​datapaths (`Int`): The first dimension of the shape. * ​bits (`Int`): The second dimension of the shape. * ​repeat (`Int`): The repeat factor. * ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register. **Args:** * ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to store to. * ​data (`SIMD[type, width]`): The data to store into the tensor memory. --- ## tcgen05_store_wait `tcgen05_store_wait()` Waits for tensor memory stores to complete. Note: This function is only available on NVIDIA Blackwell GPUs (SM 100+). --- ## tempfile Implements the tempfile package. ## Modules * [​`tempfile`](/mojo/stdlib/tempfile/tempfile/): Implements tempfile methods. --- ## tempfile Implements tempfile methods. You can import a method from the `tempfile` package. For example: ```mojo from tempfile import gettempdir ``` ## Aliases ### `TMP_MAX` `alias TMP_MAX = 10000` ## Structs * [​`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile): A handle to a temporary file. * [​`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory): A temporary directory. ## Functions * [​`gettempdir`](/mojo/stdlib/tempfile/tempfile/gettempdir): Return the default directory to use for temporary files. * [​`mkdtemp`](/mojo/stdlib/tempfile/tempfile/mkdtemp): Create a temporary directory. Caller is responsible for deleting the directory when done with it. --- ## TemporaryDirectory `struct TemporaryDirectory` A temporary directory. ## Fields * ​name (`String`): The name of the temporary directory. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), ignore_cleanup_errors: Bool = False)` Create a temporary directory. Can be used as a context manager. When used as a context manager, the directory is removed when the context manager exits. **Args:** * ​suffix (`String`): Suffix to use for the directory name. * ​prefix (`String`): Prefix to use for the directory name. * ​dir (`Optional[String]`): Directory in which the directory will be created. * ​ignore\_cleanup\_errors (`Bool`): Whether to ignore cleanup errors. ### `__enter__` `__enter__(self) -> String` The function to call when entering the context. **Returns:** The temporary directory name. ### `__exit__` `__exit__(self)` Called when exiting the context with no error. `__exit__(self, err: Error) -> Bool` Called when exiting the context with an error. **Args:** * ​err (`Error`): The error raised inside the context. **Returns:** True if the temporary directory was removed successfully. --- ## tensor APIs to create and manage tensors in a graph. ## Modules * [​`io_spec`](/max/api/mojo/tensor/io_spec/): * [​`managed_tensor_slice`](/max/api/mojo/tensor/managed_tensor_slice/): Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations. * [​`tensor_spec`](/max/api/mojo/tensor/tensor_spec/): You can import these APIs from the `max.tensor` package. For example: * [​`transitional`](/max/api/mojo/tensor/transitional/): Utilities for transitional period during NDBuffer deprecation. --- ## Tensor ```c #include "max/c/tensor.h" ``` ## Functions ### `M_newTensorSpec()` > [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_newTensorSpec(const int64\_t \*shape, int64\_t rankSize, [M\_Dtype](types.md#_CPPv47M_Dtype) dtype, const char \*tensorName) Creates a tensor specification. You need this in order to set the input tensors with [`M_borrowTensorInto()`](#tensor_8h_1ab98a1def2bfd4b49ac1d3a1b77ed96b9). When storing tensor data in memory, we always use a diminishing stride size. That is, earlier dimensions in the shape have larger strides than later dimensions. For example, a C array declared as `int arr[1][2][3]` would have a shape specified as `{1, 2, 3}`. * **Parameters:** * **shape** – The shape of the tensor. * **rankSize** – The rank size of the tensor. * **dtype** – The datatype for the tensor. * **tensorName** – The name for the tensor. This string gets copied as part of the operation of `M_newTensorSpec`, so your original string need not remain valid after the completion of this call. * **Returns:** A pointer to the tensor spec. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensorSpec()`](#tensor_8h_1af0b957daeba1760134c3f24079b53026). ### `M_isDynamicRanked()` > int M\_isDynamicRanked(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec) Returns if the given spec has a dynamic rank. * **Parameters:** **spec** – The tensor spec. * **Returns:** `1` if the rank is dynamic. `0` otherwise. ### `M_getDimAt()` > int64\_t M\_getDimAt(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec, size\_t axis) Gets the element at a particular axis. * **Parameters:** * **spec** – The tensor spec. * **axis** – The requested axis * **Returns:** The dimension at requested axis if the spec and axis are valid and has static rank. Otherwise, `0`. A dimension equaling [`M_getDynamicDimensionValue()`](common.md#common_8h_1ad250f12f9b0d259172899cc8c1076760) indicates dynamic dimension e.g. batch-size of a model expecting a batched tensor. ### `M_getRank()` > int64\_t M\_getRank(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec) Gets the rank from the tensor spec. * **Parameters:** **spec** – The tensor spec. * **Returns:** The number of dimensions in the tensor spec if the spec is static and valid, [`M_getDynamicRankValue()`](common.md#common_8h_1a3d88fdacf1960a0bcab4fc9e6768701d) if dynamic. Otherwise, `0`. ### `M_getDtype()` > [M\_Dtype](types.md#_CPPv47M_Dtype) M\_getDtype(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec) Gets the datatype from the tensor spec. * **Parameters:** **spec** – The tensor spec. * **Returns:** The element type from the tensor spec if the tensor spec is valid. Otherwise, `M_UNKNOWN`. ### `M_getName()` > const char \*M\_getName([M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec) Gets the name of the tensor from the tensor spec. * **Parameters:** **spec** – The tensor spec. * **Returns:** A null-terminated string containing the name of the tensor if the `spec` is valid. Otherwise, `NULL`. The memory associated with the returned string is owned by `spec`. ### `M_newAsyncTensorMap()` > [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*M\_newAsyncTensorMap(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a map of tensor names to async tensors. * **Parameters:** **context** – The runtime context. * **Returns:** A pointer to the tensor map. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeAsyncTensorMap()`](#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e). ### `M_copyAsyncTensorMap()` > [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*M\_copyAsyncTensorMap(const [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap) Copies a tensor map. * **Parameters:** **tensorMap** – The tensor map to copy. * **Returns:** A pointer to the tensor map. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeAsyncTensorMap()`](#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e). ### `M_getTensorMapSize()` > size\_t M\_getTensorMapSize(const [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets the size of the tensor map. * **Parameters:** * **tensorMap** – The tensor map. * **status** – The status object for reporting errors. * **Returns:** The size of the tensor map if the tensor map is valid. Otherwise, `0` and the `status` parameter contains an error message. ### `M_borrowTensorInto()` > void M\_borrowTensorInto([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensors, const void \*input, const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*tensorSpec, [M\_Status](types.md#_CPPv48M_Status) \*status) Adds a tensor to the tensor map. You are responsible for the lifetime of the input tensor data. Its data gets “borrowed” into the Tensor Map. * **Parameters:** * **tensors** – The tensor map, from [`M_newAsyncTensorMap()`](#tensor_8h_1a18039c6e6c1769b947120b27178306eb). * **input** – The input tensor data. * **tensorSpec** – The tensor spec, from [`M_newTensorSpec()`](#tensor_8h_1a964a8ab740605dbc51321702c34caeef). This gets copied as part of the operation of `M_borrowTensorInto`, so your original tensorSpec need not exist through the lifetime of the tensor map. * **status** – The status object for reporting errors. ### `M_createBorrowedTensor()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createBorrowedTensor(const void \*data, const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*tensorSpec, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a borrowed tensor wrapped in an `AsyncValue`. * **Parameters:** * **data** – The tensor data. * **tensorSpec** – The tensor spec, from [`M_newTensorSpec()`](#tensor_8h_1a964a8ab740605dbc51321702c34caeef). This gets copied as part of the operation of [`M_createBorrowedTensor()`](#tensor_8h_1a3178be3c58f89669aeb362433c7713d9), so your original tensorSpec need not exist through the lifetime of the tensor. * **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](value.md#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`, however the tensor data is borrowed and must outlive the returned `M_AsyncValue`. ### `M_getTensorByNameFrom()` > [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*M\_getTensorByNameFrom([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, const char \*name, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets a tensor from the tensor map by name. * **Parameters:** * **tensorMap** – The tensor map. * **name** – The name of the tensor. * **status** – The status object for reporting errors. * **Returns:** A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the tensor map or name are invalid, a `NULL` pointer is returned and the `status` parameter contains an error message. ### `M_tensorMapKeys()` > const char \*\*M\_tensorMapKeys([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, int64\_t \*size) ### `M_deleteTensorMapKeys()` > void M\_deleteTensorMapKeys(const char \*\*keys) ### `M_getTensorFromValue()` > [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*M\_getTensorFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a tensor from the async value. * **Parameters:** **value** – The async value. * **Returns:** A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the `M_AsyncValue`. Note that the tensor name is not available through this method (unlike `M_getTensorByNameFrom`). If the value is invalid or not a tensor, a `NULL` pointer is returned. ### `M_getTensorNumElements()` > size\_t M\_getTensorNumElements(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor) Gets the number of elements for the tensor. * **Parameters:** **tensor** – The tensor which must not be `NULL`. * **Returns:** The number of elements for the given tensor. ### `M_getTensorType()` > [M\_Dtype](types.md#_CPPv47M_Dtype) M\_getTensorType(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor) Gets the corresponding `M_Dtype` for the tensor. * **Parameters:** **tensor** – The tensor which must not be `NULL`. * **Returns:** The corresponding `M_Dtype` for the tensor. ### `M_getTensorData()` > const void \*M\_getTensorData(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor) Gets a pointer to underlying data of the tensor. * **Parameters:** **tensor** – The tensor which must not be `NULL`. * **Returns:** A pointer to the underlying data of the tensor. This pointer is valid for the lifetime of the underlying tensor. ### `M_getTensorSpec()` > [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_getTensorSpec(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor) Gets a Tensor Spec for the tensor. * **Parameters:** **tensor** – The tensor. * **Returns:** The tensor spec for the tensor if the tensor is valid. Otherwise, `NULL`. ### `M_getTensorMapIterator()` > [M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*M\_getTensorMapIterator([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets a tensor map iterator for the tensor map. * **Parameters:** * **tensorMap** – The tensor map. * **status** – The status object for reporting errors. * **Returns:** A pointer to the tensor map iterator. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensorMapIterator()`](#tensor_8h_1a19fe7668b091cfa8c7e52d53612445ff). If the tensor map is invalid, a `NULL` pointer is returned and the `status` parameter contains an error message. ### `M_advanceTensorMapIterator()` > void M\_advanceTensorMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator) Advances the tensor map iterator by one entry. * **Parameters:** **iterator** – The tensor map iterator. ### `M_getNameFromMapIterator()` > const char \*M\_getNameFromMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator) Gets the name of the tensor from the tensor map iterator. * **Parameters:** **iterator** – The tensor map iterator. * **Returns:** A null-terminated string containing the name of the tensor if the `iterator` is valid. Otherwise, `NULL`. The memory associated with the returned string is owned by `spec`. ### `M_getTensorFromMapIterator()` > [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*M\_getTensorFromMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator) Gets the tensor from the tensor map iterator. * **Parameters:** **iterator** – The tensor map iterator. * **Returns:** A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the tensor map iterator is invalid, a `NULL` pointer is returned. ### `M_isEndOfTensorMap()` > bool M\_isEndOfTensorMap([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, [M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator) Checks if the iterator has reached the end of the tensor map. * **Parameters:** * **tensorMap** – The tensor map. * **iterator** – The tensor map iterator. * **Returns:** True if the iterator points to the end of the map, false otherwise. Also returns true if either the tensorMap or iterator are invalid. ### `M_freeTensor()` > void M\_freeTensor([M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor) Deallocates the memory for the tensor. No-op if `tensor` is NULL. * **Parameters:** **tensor** – The tensor to deallocate. ### `M_freeTensorNameArray()` > void M\_freeTensorNameArray([M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*names) Deallocates the memory for the array of tensor names. No-op if `names` is `NULL`. * **Parameters:** **names** – The tensor names to deallocate. ### `M_freeTensorSpec()` > void M\_freeTensorSpec([M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec) Deallocates the memory for the tensor spec. No-op if `spec` is `NULL`. * **Parameters:** **spec** – The tensor spec to deallocate. ### `M_freeAsyncTensorMap()` > void M\_freeAsyncTensorMap([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap) Deallocates the memory for the tensor map. No-op if `tensorMap` is `NULL`. * **Parameters:** **tensorMap** – The tensor map to deallocate. ### `M_freeTensorMapIterator()` > void M\_freeTensorMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator) Deallocates the memory for the tensor map iterator. No-op if `iterator` is `NULL`. * **Parameters:** **iterator** – The tensor map iterator to deallocate. --- ## tensor_builder Tensor Builder Module Provides a fluent interface for constructing tensors with various layouts and memory configurations. It includes utilities for creating both static (compile-time) and dynamic (runtime) tensor dimensions, supporting row-major, column-major, and custom layouts. The module enables memory placement in different address spaces (generic, shared, local) and supports features like circular indexing. Key components: * `ValueOrUnknown`: Represents static or dynamic tensor dimensions * `LayoutTensorBuild`: Builder class for tensor construction * Helper functions for dimension specification and layout creation ## Structs * [​`LayoutTensorBuild`](./LayoutTensorBuild): Tensor layout builder providing a fluent interface for constructing tensors with various layouts. * [​`ValueOrUnknown`](./ValueOrUnknown): Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime). ## Functions * [​`dynamic`](./dynamic): Creates a dynamic dimension with runtime value. * [​`static`](./static): Creates a static dimension with compile-time value. --- ## tensor_core Tensor Core Module for High-Performance Matrix Operations Provides abstractions for using GPU Tensor Cores to perform optimized matrix operations. It supports both NVIDIA and AMD GPU architectures with hardware-specific optimizations. ## Key Components: * `TensorCore`: Core struct that encapsulates tensor core operations with support for various data types and matrix shapes. It handles loading matrix fragments, performing matrix multiply-accumulate operations, and storing results. * Matrix Fragment Management: Functions for loading and storing matrix fragments to/from shared memory with hardware-specific optimizations. * Matrix Multiply-Accumulate (MMA): Optimized implementations of matrix multiplication operations using tensor cores. ## Supported Operations: * Matrix loading with various layouts and swizzling patterns * Matrix multiply-accumulate (D = A \* B + C) * Matrix storing with hardware-specific optimizations ## Supported Data Types: * NVIDIA: float32, bfloat16, float16, float8\_e4m3fn, float8\_e5m2 * AMD: float32, bfloat16, float16 ## Supported Matrix Shapes: * NVIDIA: 16×8×8, 16×8×4, 16×8×16, 8×8×4, 16×8×32 * AMD: 16×16×4, 16×16×16, 32×32×8 ## Aliases ### `shape_16x16x16` `alias shape_16x16x16 = IndexList(16, 16, 16, Tuple())` ### `shape_16x16x4` `alias shape_16x16x4 = IndexList(16, 16, 4, Tuple())` ### `shape_16x8x16` `alias shape_16x8x16 = IndexList(16, 8, 16, Tuple())` ### `shape_16x8x32` `alias shape_16x8x32 = IndexList(16, 8, 32, Tuple())` ### `shape_16x8x4` `alias shape_16x8x4 = IndexList(16, 8, 4, Tuple())` ### `shape_16x8x8` `alias shape_16x8x8 = IndexList(16, 8, 8, Tuple())` ### `shape_32x32x8` `alias shape_32x32x8 = IndexList(32, 32, 8, Tuple())` ### `shape_8x8x4` `alias shape_8x8x4 = IndexList(8, 8, 4, Tuple())` ### `shape_null` `alias shape_null = IndexList(0, 0, 0, Tuple())` ## Structs * [​`TensorCore`](./TensorCore): TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations. ## Functions * [​`get_fragment_size`](./get_fragment_size): Calculates the fragment size per thread for a given MMA shape. * [​`get_mma_shape`](./get_mma_shape): Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations. * [​`num_matrix_reg`](./num_matrix_reg): Calculates the number of matrix registers required per thread. --- ## tensor_core_async Tensor Core Async Module This module provides high-performance abstractions for utilizing NVIDIA's Tensor Cores to perform asynchronous matrix multiplication operations. It implements optimized memory layouts and access patterns for efficient tensor core computations. Key components: * Layout creation functions for K-major and MN-major memory arrangements * Swizzling support for improved memory access patterns * WGMMA (Warp Group Matrix Multiply-Accumulate) descriptor generation * TensorCoreAsync struct with methods for asynchronous matrix multiplication The module supports various data types, matrix dimensions, and memory configurations, enabling efficient implementation of deep learning primitives and other tensor operations that can leverage hardware acceleration. Performance features: * Asynchronous execution model to overlap computation and memory access * Support for different swizzling modes to optimize memory bandwidth * Efficient register and shared memory utilization * Support for multi-warp group execution This implementation is specifically optimized for NVIDIA GPUs with Tensor Core support. ## Aliases ### `WGMMA_K_BYTES` `alias WGMMA_K_BYTES = 32` ## Structs * [​`TensorCoreAsync`](./TensorCoreAsync): High-performance asynchronous tensor core operations for matrix multiplication. ## Functions * [​`select_k_atom`](./select_k_atom): Creates a core matrix layout for tensor core operations. * [​`st_matrix_n_atom`](./st_matrix_n_atom): Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix. * [​`st_matrix_n_layout`](./st_matrix_n_layout): Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix. * [​`tile_layout_k_major`](./tile_layout_k_major): Creates a K-major layout for tensor core operations. * [​`tile_layout_mn_major`](./tile_layout_mn_major): Creates an MN-major layout for tensor core operations. * [​`tile_to_descriptor`](./tile_to_descriptor): Transforms a layout into a WGMMA descriptor-compatible layout. * [​`wgmma_c_layout`](./wgmma_c_layout): Generates three layouts for mapping WGMMA C matrix coordinates. * [​`wgmma_c_thread_layout`](./wgmma_c_thread_layout): Returns the thread layout component for WGMMA C matrix. * [​`wgmma_output_layout`](./wgmma_output_layout): Returns the output layout component for WGMMA C matrix. --- ## tensor_ops This module provides tensor core operations and utilities for GPU computation. The module includes functions for: * Tensor core based reductions (tc\_reduce) supporting various data types and SIMD widths * GEVM (General Matrix-Vector Multiplication) reductions using tensor cores * Efficient warp-level reductions leveraging tensor core operations The tensor core operations are optimized for NVIDIA GPUs and support different data types including float32, float16, and bfloat16. The module provides both scalar and vector variants of reduction operations with different SIMD widths for maximum performance. Key functions: * tc\_reduce: Main tensor core reduction function supporting various types and widths * tc\_reduce\_gevm\_8x: 8x GEVM reduction using tensor cores * tc\_reduce\_gevm\_4x: 4x GEVM reduction using tensor cores Note: Most operations require NVIDIA GPUs with tensor core support. Operations are optimized for warp-level execution. ## Functions * [​`tc_reduce`](/mojo/stdlib/gpu/tensor_ops/tc_reduce): Performs tensor core based reduction on a SIMD vector. * [​`tc_reduce_gevm_4x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_4x): Performs a 4x GEVM reduction using tensor cores. * [​`tc_reduce_gevm_8x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_8x): Performs an 8x GEVM reduction using tensor cores. --- ## tensor_spec You can import these APIs from the `max.tensor` package. For example: ```mojo from max.tensor import RuntimeTensorSpec ``` ## Structs * [​`RuntimeTensorSpec`](/max/api/mojo/tensor/tensor_spec/RuntimeTensorSpec): --- ## TensorCore `struct TensorCore[out_type: DType, in_type: DType, shape: IndexList[3], transpose_b: Bool = False]` TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations. This struct encapsulates the functionality required to efficiently map matrix operations to Tensor Cores on NVIDIA and AMD GPUs. It handles loading matrix fragments, performing matrix multiply-accumulate operations, and storing results with hardware-specific optimizations. Note: Different shapes and data types are supported depending on the GPU hardware. For NVIDIA GPUs: * float32: 16×8×8 or 16×8×4 * half-precision: 16×8×16 * float8: 16×8×32 For AMD GPUs: * float32: 16×16×4 * half-precision: 16×16×16 or 32×32×8 ## Parameters * ​out\_type (`DType`): The data type for output/accumulation operations. * ​in\_type (`DType`): The data type for input matrix elements. * ​shape (`IndexList[3]`): The shape parameters for the matrix operation in the form \[M, N, K] where M×N is the output shape and K is the inner dimension. * ​transpose\_b (`Bool`): Whether to transpose the B matrix before multiplication. Defaults to False. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `a_reg_type` `alias a_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]` ### `b_reg_type` `alias b_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]` ### `c_reg_tile_type` `alias c_reg_tile_type = LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]` ### `c_reg_type` `alias c_reg_type = SIMD[out_type, num_matrix_reg[::Int,::Int]()]` ### `supported_fp32` `alias supported_fp32 = (shape == IndexList(16, 8, 8, Tuple())) if is_nvidia_gpu() else (shape == IndexList(16, 16, 4, Tuple())) if (in_type is float32) else (in_type is float32)` ### `supported_fp8` `alias supported_fp8 = (shape == IndexList(16, 8, 32, Tuple())) if Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type) else Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type)` ### `supported_half` `alias supported_half = (shape == IndexList(16, 8, 16, Tuple())) if is_nvidia_gpu() else Tuple(VariadicPack(IndexList(16, 16, 16, Tuple()), IndexList(32, 32, 8, Tuple()))).__contains__[::EqualityComparable & ::Copyable & ::Movable](shape) if in_type.is_half_float() else in_type.is_half_float()` ## Methods ### `__init__` `__init__(out self)` Initialize a new TensorCore instance. ### `get_shapes` `static get_shapes[out_type: DType, in_type: DType]() -> List[IndexList[3]]` Get supported shapes for given data types. Returns a list of valid shapes for the specified output and input data types. Note: The returned shapes are hardware-dependent. Different shapes are supported for different combinations of input and output types. **Parameters:** * ​out\_type (`DType`): The output/accumulation data type. * ​in\_type (`DType`): The input matrix data type. **Returns:** List\[IndexList\[3]]: Valid shapes for the matrix operations given the specified types. ### `load_a` `load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_a_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]` Load the A matrix fragments. Loads matrix A from memory into a LayoutTensor suitable for tensor core operations. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only). **Args:** * ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix A data. **Returns:** The loaded matrix fragments as a `LayoutTensor`. `load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))` Load A matrix fragments from shared memory. Optimized version for loading A matrix fragments from shared memory. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for to optimize memory bandwidth. **Args:** * ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source data in shared memory. * ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor for fragments. * ​mma\_tile\_coord\_k (`UInt`): The K coordinate of the MMA tile. Defaults to 0. ### `load_b` `load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_b_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]` Load the B matrix fragments. Loads matrix B from memory into a `LayoutTensor` suitable for tensor core operations. The function handles different hardware architectures and memory access patterns. Note: If transpose\_b is `True`, the B matrix will be transposed during loading. This is more efficient than transposing the matrix in memory. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only). Will cause an error if used with NVIDIA GPUs. **Args:** * ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix B data. **Returns:** The loaded matrix fragments as a `LayoutTensor`. `load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0), warp_tile_coord_n: UInt = UInt(0))` Load B matrix fragments from shared memory into registers for tensor core operations. This function loads matrix B fragments from a warp tile in shared memory into register fragments for use in tensor core matrix multiply operations. It handles hardware-specific optimizations for both NVIDIA and AMD GPUs. Note: The `warp_tile` must be in shared memory. For NVIDIA GPUs, `swizzle` must be `None`. For AMD GPUs, providing an appropriate `swizzle` pattern can improve performance. **Parameters:** * ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for AMD GPUs to optimize memory bandwidth. Must be None when running on NVIDIA GPUs. For NVIDIA GPUs, swizzle is always on. **Args:** * ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the B matrix data. * ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the loaded matrix fragments. * ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0. * ​warp\_tile\_coord\_n (`UInt`): N-dimension coordinate within the warp tile. Defaults to 0. `load_b(self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scales: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))` Load quantized B matrix fragments from shared memory with dequantization. This function loads int4 quantized matrix B fragments from shared memory, dequantizes them using the provided scales, and stores the result in register fragments for tensor core operations. Notes: * The `warp_tile` must be in shared memory. * The `fragments` and `scales` must be in local memory. * This function only supports half-precision data types (bfloat16, float16). * The quantized data is stored as int4 values packed into int32 elements. * Each thread processes multiple fragments by unpacking and dequantizing the int4 values. **Args:** * ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the quantized B matrix data. * ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the dequantized matrix fragments. * ​scales (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): `LayoutTensor` containing the scaling factors for dequantization. * ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0. ### `load_c` `load_c(self, c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]` Load the C matrix fragments. Loads matrix C from memory into a `LayoutTensor` suitable for tensor core operations. The function handles different hardware architectures and memory access patterns. **Args:** * ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix C data. **Returns:** The loaded matrix fragments as a `LayoutTensor`. ### `store_d` `store_d(self, d_dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], d_src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Store matrix D to destination memory. Stores the result matrix D from tensor core computation to the destination memory. **Args:** * ​d\_dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor to store the result. * ​d\_src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor containing the computed result. ### `mma_op` `mma_op(self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]` Perform matrix multiply-accumulate operation (MMA). Executes `D = A * B + C` using tensor cores. **Args:** * ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The A matrix input. * ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The B matrix input. * ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The C matrix input for accumulation. **Returns:** `Self.c_reg_tile_type`: The result of the MMA operation. ### `mma` `mma(self, a_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Perform matrix multiply-accumulate operation using tensor cores. Executes C = A \* B + C using tensor cores, where A, B, and C are matrix fragments stored in register memory. This function handles the mapping of fragments to hardware tensor core operations. Notes: * All fragments must be properly loaded using the corresponding load functions. * The function assumes fragments are vectorized layout tensors with dimensions num\_vectors x 1. * The c\_frag shape\[0] must equal num\_m\_mmas \* num\_n\_mmas. * The result is accumulated in-place in c\_frag. **Args:** * ​a\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A fragments as a `LayoutTensor`. * ​b\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B fragments as a `LayoutTensor`. * ​c\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix C fragments as a `LayoutTensor` for both input and output. --- ## TensorCoreAsync `struct TensorCoreAsync[c_type: DType, a_type: DType, b_type: DType, mma_shape: IndexList[3], /, a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = False]` High-performance asynchronous tensor core operations for matrix multiplication. This struct provides methods for utilizing NVIDIA's Tensor Cores for asynchronous matrix multiplication operations, with support for various data types and swizzling configurations. ## Parameters * ​c\_type (`DType`): Data type of the output matrix C. * ​a\_type (`DType`): Data type of the input matrix A. * ​b\_type (`DType`): Data type of the input matrix B. * ​mma\_shape (`IndexList[3]`): Dimensions for the matrix multiply-accumulate (MMA) operation as \[M, N, K]. * ​a\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix A (default: SWIZZLE\_NONE). * ​b\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix B (default: SWIZZLE\_NONE). * ​transpose\_b (`Bool`): Whether to transpose matrix B (default: False). ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initialize the `TensorCoreAsync` instance. Ensures that the provided MMA shape is supported. Note: Fails to compile if `mma_shape` is not supported. ### `wgmma` `static wgmma[num_warp_groups: Int = 1, scale_c: Int = 1, scale_a: Int = 1, scale_b: Int = 1](a_smem_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], wg_idx: Int = 0)` Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA). This method handles the case where both A and B matrices are in shared memory. **Parameters:** * ​num\_warp\_groups (`Int`): Number of warp groups to distribute work across (default: 1). * ​scale\_c (`Int`): Scale factor for matrix C. Valid values are 1 or 0 (default: 1). * ​scale\_a (`Int`): Scale factor for matrix A. Valid values are 1 or -1 (default: 1). * ​scale\_b (`Int`): Scale factor for matrix B. Valid values are 1 or -1 (default: 1). **Args:** * ​a\_smem\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in shared memory. * ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory. * ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory. * ​wg\_idx (`Int`): Warp group index for multi-warp group scenarios (default: 0). `static wgmma(a_frag_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA). This overloaded method handles the case where matrix A is in register memory and matrix B is in shared memory. **Args:** * ​a\_frag\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in register memory. * ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory. * ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory. ### `arrive` `static arrive()` Ensures memory consistency by creating a fence for WGMMA operations. This method should be called before committing a group to ensure all shared memory accesses are properly aligned and visible. ### `commit_group` `static commit_group()` Commits the current warp group for execution. This synchronizes the warp group and commits all pending WGMMA operations that have been previously issued. ### `wait_group` `static wait_group[group: Int = 0]()` Waits for the completion of a specific warp group's operations. This method blocks until all WGMMA operations from the specified group are complete. **Parameters:** * ​group (`Int`): The group ID to wait for (default: 0). --- ## TensorMemory `@register_passable(trivial)` `struct TensorMemory` A wrapper around tensor memory allocated for tcgen05 instructions. ## Fields * ​ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Pointer to the tensor memory address. * ​num\_cols (`SIMD[uint32, 1]`): The number of columns in the tensor memory. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(num_cols: SIMD[uint32, 1]) -> Self` Initialize the TensorMemory struct. **Args:** * ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate. --- ## TensorValue Library for the graph TensorValue class. ## `TensorValue` {#max.graph.TensorValue} > *class* max.graph.TensorValue(value) Bases: [`Value`](Value.md#max.graph.Value)\[`TensorType`] Represents a value semantic tensor within a [`Graph`](Graph.md#max.graph.Graph). It provides various methods and properties to manipulate and query tensor attributes such as [`shape`](#max.graph.TensorValue.shape), data type ([`dtype`](#max.graph.TensorValue.dtype)), device placement ([`device`](#max.graph.TensorValue.device)), and more. The following example demonstrates how to create and manipulate tensor values in a graph: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("tensor_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Access tensor properties print(f"Shape: {tensor.shape}") # Output: [2, 2] print(f"Data type: {tensor.dtype}") # Output: DType.float32 # Perform operations on the tensor transposed = tensor.T doubled = tensor * 2 print(f"Original shape: {tensor.shape}") # Output: [2, 2] print(f"Transposed shape: {transposed.shape}") # Output: [2, 2] ``` Value is abstract, it shouldn’t be constructed directly. **Parameters:** **value** (`TensorValueLike` ) ### `T` {#max.graph.TensorValue.T} > *property* T\*: [TensorValue](#max.graph.TensorValue)\* Returns the transposed tensor. [`T`](#max.graph.TensorValue.T) is the shorthand notation for transposing. For more information, see [`transpose()`](#max.graph.TensorValue.transpose). **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with swapped dimensions. ### `broadcast_to()` {#max.graph.TensorValue.broadcast_to} > broadcast\_to(shape) Broadcasts the tensor to a new shape. The following example demonstrates how to broadcast a tensor to a larger shape: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x2 matrix matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("broadcast_to_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Broadcast tensor to a 3x2x2 tensor (add a new dimension of size 3) broadcasted_tensor = tensor.broadcast_to((3, 2, 2)) print(f"Original shape: {tensor.shape}") # Output: [2, 2] print(f"Broadcasted shape: {broadcasted_tensor.shape}") # Output: [3, 2, 2] ``` **Parameters:** **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – An iterable of integers or symbolic dimensions. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with the broadcasted shape. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `cast()` {#max.graph.TensorValue.cast} > cast(dtype) Casts a symbolic tensor to a different data type. The following example demonstrates how to cast a tensor from one data type to another: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a matrix with float32 values matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("cast_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Cast tensor to integer type casted_tensor = tensor.cast(DType.int32) print(f"Original dtype: {tensor.dtype}") # Output: DType.float32 print(f"Casted dtype: {casted_tensor.dtype}") # Output: DType.int32 ``` **Parameters:** **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The target data type (e.g., `DType.int32`, `DType.float64`). **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with the casted data type. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `device` {#max.graph.TensorValue.device} > *property* device\*: DeviceRef\* Returns the device of the TensorValue. ### `dtype` {#max.graph.TensorValue.dtype} > *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\* Returns the tensor data type. The following example demonstrates how to access the data type of a tensor: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a matrix with float32 values matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("dtype_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Access tensor data type print(f"Data type: {tensor.dtype}") # Output: DType.float32 ``` ### `flatten()` {#max.graph.TensorValue.flatten} > flatten(start\_dim=0, end\_dim=-1) Flattens the specified dims of a symbolic tensor. The number and order of the elements in the tensor is unchanged. All dimensions from `start_dim` to `end_dim` (inclusive) are merged into a single output dim. The following example demonstrates how to flatten a multi-dimensional tensor: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x2 matrix matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("flatten_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Flatten the tensor to a 1D array flattened_tensor = tensor.flatten() print(f"Original shape: {tensor.shape}") # Output: [2, 2] print(f"Flattened shape: {flattened_tensor.shape}") # Output: [4] ``` **Parameters:** * **start\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The starting dimension to flatten. Defaults to `1`. * **end\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The ending dimension to flatten. Defaults to `-1`. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with the flattened dimensions. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `permute()` {#max.graph.TensorValue.permute} > permute(dims) Permutes the tensor’s dimensions based on provided indices. **Parameters:** **dims** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – A list of integers specifying the new order of dimensions. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with permuted dimensions. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `print()` {#max.graph.TensorValue.print} > print(label='debug\_tensor') Prints detailed information about the tensor. **Parameters:** **label** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – A string label for the printed output. Defaults `debug_tensor`. ### `rank` {#max.graph.TensorValue.rank} > *property* rank\*: [int](https://docs.python.org/3/library/functions.html#int)\* Returns the rank (number of dims) of the buffer. The following example demonstrates how to access the rank of a tensor: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x2 matrix (2-dimensional array) matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("rank_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Access tensor rank (number of dimensions) print(f"Rank: {tensor.rank}") # Output: 2 ``` ### `rebind()` {#max.graph.TensorValue.rebind} > rebind(shape, message='') Rebinds the tensor to a new shape with error handling. **Parameters:** * **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as an iterable of integers or symbolic dimensions. * **message** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – (optional) A message for logging or debugging. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with the updated shape. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `reshape()` {#max.graph.TensorValue.reshape} > reshape(shape) Creates a new tensor with the same data but reshaped. The following example demonstrates how to reshape a tensor to change its dimensions: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x2 matrix matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("reshape_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Reshape tensor to a 1x4 matrix reshaped_tensor = tensor.reshape((1, 4)) print(f"Original shape: {tensor.shape}") # Output: [2, 2] print(f"Reshaped shape: {reshaped_tensor.shape}") # Output: [1, 4] ``` **Parameters:** **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](type.md#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as an iterable of integers or symbolic dimensions. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with the reshaped dimensions. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `shape` {#max.graph.TensorValue.shape} > *property* shape\*: [Shape](type.md#max.graph.type.Shape)\* Returns the shape of the [`TensorValue`](#max.graph.TensorValue). The following example demonstrates how to access the shape of a tensor: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x2 matrix matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("shape_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Access tensor shape print(f"Shape: {tensor.shape}") # Shape: [Dim(2), Dim(2)] ``` ### `to()` {#max.graph.TensorValue.to} > to(device) Transfers the tensor to a specified device without mutation. The following example demonstrates how to move a tensor from one device to another: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops, DeviceRef # Create a 2x2 matrix matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) with Graph("to_device_example") as graph: # Create a tensor on the default device tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Move the tensor to a GPU device gpu_tensor = tensor.to(DeviceRef.GPU()) print(f"Original device: {tensor.device}") # Output depends on default device print(f"New device: {gpu_tensor.device}") # Output: gpu:0 ``` **Parameters:** **device** (`DeviceRef` ) – A `DeviceRef` object specifying the target device. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) on the specified device. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `transpose()` {#max.graph.TensorValue.transpose} > transpose(dim\_1, dim\_2) Swaps two dimensions of the tensor. The following example demonstrates how to transpose a tensor by swapping its dimensions: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x3 matrix matrix = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32) with Graph("transpose_demo") as graph: tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Transpose the tensor (swap dimensions 0 and 1) transposed_tensor = tensor.transpose(dim_1=0, dim_2=1) print(f"Original shape: {tensor.shape}") # Output: [2, 3] print(f"Transposed shape: {transposed_tensor.shape}") # Output: [3, 2] ``` **Parameters:** * **dim\_1** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The first dimension to swap. * **dim\_2** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The second dimension to swap. **Returns:** A new [`TensorValue`](#max.graph.TensorValue) with swapped dimensions. **Return type:** [*TensorValue*](#max.graph.TensorValue) ### `type` {#max.graph.TensorValue.type} > *property* type\*: [TensorType](type.md#max.graph.type.TensorType)\* Returns the type of the [`TensorValue`](#max.graph.TensorValue) as a `TensorType`. --- ## terminate This module includes the exit functions. ## Functions * [​`exit`](/mojo/stdlib/sys/terminate/exit): Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit. --- ## testing Implements the testing package. ## Modules * [​`testing`](/mojo/stdlib/testing/testing/): Implements various testing utils. --- ## testing Implements various testing utils. You can import these APIs from the `testing` package. For example: ```mojo from testing import assert_true def main(): x = 1 y = 2 try: assert_true(x==1) assert_true(y==2) assert_true((x+y)==3) print("All assertions succeeded") except e: print("At least one assertion failed:") print(e) ``` ## Structs * [​`assert_raises`](/mojo/stdlib/testing/testing/assert_raises): Context manager that asserts that the block raises an exception. ## Functions * [​`assert_almost_equal`](/mojo/stdlib/testing/testing/assert_almost_equal): Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised. * [​`assert_equal`](/mojo/stdlib/testing/testing/assert_equal): Asserts that the input values are equal. If it is not then an Error is raised. * [​`assert_false`](/mojo/stdlib/testing/testing/assert_false): Asserts that the input value is False and raises an Error if it's not. * [​`assert_is`](/mojo/stdlib/testing/testing/assert_is): Asserts that the input values have the same identity. If they do not then an Error is raised. * [​`assert_is_not`](/mojo/stdlib/testing/testing/assert_is_not): Asserts that the input values have different identities. If they do not then an Error is raised. * [​`assert_not_equal`](/mojo/stdlib/testing/testing/assert_not_equal): Asserts that the input values are not equal. If it is not then an Error is raised. * [​`assert_true`](/mojo/stdlib/testing/testing/assert_true): Asserts that the input value is True and raises an Error if it's not. --- ## Testing Mojo includes a framework for developing and executing unit tests. The Mojo testing framework consists of a set of assertions defined as part of the [Mojo standard library](/mojo/lib) and the [`mojo test`](/mojo/cli/test) command line tool. ## Get started Let's start with a simple example of writing and running Mojo tests. ### 1. Write tests For your first example of using the Mojo testing framework, create a file named `test_quickstart.mojo` containing the following code: ```mojo # Content of test_quickstart.mojo from testing import assert_equal def inc(n: Int) -> Int: return n + 1 def test_inc_zero(): # This test contains an intentional logical error to show an example of # what a test failure looks like at runtime. assert_equal(inc(0), 0) def test_inc_one(): assert_equal(inc(1), 2) ``` In this file, the `inc()` function is the test *target*. The functions whose names begin with `test_` are the tests. Usually the target should be in a separate source file from its tests, but you can define them in the same file for this simple example. A test function *fails* if it raises an error when executed, otherwise it *passes*. The two tests in this example use the `assert_equal()` function, which raises an error if the two values provided are not equal. :::note The implementation of `test_inc_zero()` contains an intentional logical error so that you can see an example of a failed test when you execute it in the next step of this tutorial. ::: ### 2. Execute tests Then in the directory containing the file, execute the following command in your shell: ```bash mojo test test_quickstart.mojo ``` You should see output similar to this (note that this example elides the full filesystem paths from the output shown): ```output Testing Time: 1.193s Total Discovered Tests: 2 Passed : 1 (50.00%) Failed : 1 (50.00%) Skipped: 0 (0.00%) ******************** Failure: 'ROOT_DIR/test_quickstart.mojo::test_inc_zero()' ******************** Unhandled exception caught during execution Error: At ROOT_DIR/test_quickstart.mojo:8:17: AssertionError: `left == right` comparison failed: left: 1 right: 0 ******************** ``` The output starts with a summary of the number of tests discovered, passed, failed, and skipped. Following that, each failed test is reported along with its error message. ### Next steps - [The `testing` module](#the-testing-module) describes the assertion functions available to help implement tests. - [Writing unit tests](#writing-unit-tests) shows how to write unit tests and organize them into test files. - [The `mojo test` command](#the-mojo-test-command) describes how to execute and collect lists of tests. - Our GitHub repo contains an [example project](https://github.com/modular/modular/tree/main/examples/mojo/testing) to demonstrate unit testing. Several of the examples shown later are based on this project. ## The `testing` module The Mojo standard library includes a [`testing`](/mojo/stdlib/testing/testing/) module that defines several assertion functions for implementing tests. Each assertion returns `None` if its condition is met or raises an error if it isn't. - [`assert_true()`](/mojo/stdlib/testing/testing/assert_true): Asserts that the input value is `True`. - [`assert_false()`](/mojo/stdlib/testing/testing/assert_false): Asserts that the input value is `False`. - [`assert_equal()`](/mojo/stdlib/testing/testing/assert_equal): Asserts that the input values are equal. - [`assert_not_equal()`](/mojo/stdlib/testing/testing/assert_not_equal): Asserts that the input values are not equal. - [`assert_almost_equal()`](/mojo/stdlib/testing/testing/assert_almost_equal): Asserts that the input values are equal up to a tolerance. The boolean assertions report a basic error message when they fail. ```mojo from testing import * assert_true(False) ``` ```output Unhandled exception caught during execution Error: At Expression [1] wrapper:14:16: AssertionError: condition was unexpectedly False ``` Each function also accepts an optional `msg` keyword argument for providing a custom message to include if the assertion fails. ```mojo assert_true(False, msg="paradoxes are not allowed") ``` ```output Unhandled exception caught during execution Error: At Expression [2] wrapper:14:16: AssertionError: paradoxes are not allowed ``` For comparing floating point values you should use `assert_almost_equal()`, which allows you to specify either an absolute or relative tolerance. ```mojo result = 10 / 3 assert_almost_equal(result, 3.33, atol=0.001, msg="close but no cigar") ``` ```output Unhandled exception caught during execution Error: At Expression [3] wrapper:15:24: AssertionError: 3.3333333333333335 is not close to 3.3300000000000001 with a diff of 0.0033333333333334103 (close but no cigar) ``` The testing module also defines a [context manager](/mojo/manual/errors#use-a-context-manager), [`assert_raises()`](/mojo/stdlib/testing/testing/assert_raises), to assert that a given code block correctly raises an expected error. ```mojo def inc(n: Int) -> Int: if n == Int.MAX: raise Error("inc overflow") return n + 1 print("Test passes because the error is raised") with assert_raises(): _ = inc(Int.MAX) print("Test fails because the error isn't raised") with assert_raises(): _ = inc(Int.MIN) ``` ```output Unhandled exception caught during execution Test passes because the error is raised Test fails because the error isn't raised Error: AssertionError: Didn't raise at Expression [4] wrapper:18:23 ``` :::note The example above assigns the return value from `inc()` to a [*discard pattern*](/mojo/manual/lifecycle/death#explicit-lifetime-extension). Without it, the Mojo compiler reports a warning that the return value is unused. ::: You can also provide an optional `contains` argument to `assert_raises()` to indicate that the test passes only if the error message contains the substring specified. Other errors are propagated, failing the test. ```mojo print("Test passes because the error contains the substring") with assert_raises(contains="required"): raise Error("missing required argument") print("Test fails because the error doesn't contain the substring") with assert_raises(contains="required"): raise Error("invalid value") ``` ```output Unhandled exception caught during execution Test passes because the error contains the substring Test fails because the error doesn't contain the substring Error: invalid value ``` ## Writing unit tests A Mojo unit test is simply a function that fulfills all of these requirements: - Has a name that starts with `test_`. - Accepts no arguments. - Returns `None`. - Raises an error to indicate test failure. - Is defined at the module scope, not as a Mojo struct method. You can use either `def` or `fn` to define a test function. Because a test function always raises an error to indicate failure, any test function defined using `fn` must include the `raises` declaration. Generally, you should use the assertion utilities from the Mojo standard library [`testing`](/mojo/stdlib/testing/testing/) module to implement your tests. You can include multiple related assertions in the same test function. However, if an assertion raises an error during execution then the test function returns immediately, skipping any subsequent assertions. You must define your Mojo unit tests in Mojo source files named with a `test` prefix or suffix. You can organize your test files within a directory hierarchy, but the test files must not be part of a Mojo package (that is, the test directories should not contain `__init__.mojo` files). Here is an example of a test file containing three tests for functions defined in a source module named `my_target_module` (which is not shown here). ```mojo # File: test_my_target_module.mojo from my_target_module import convert_input, validate_input from testing import assert_equal, assert_false, assert_raises, assert_true def test_validate_input(): assert_true(validate_input("good"), msg="'good' should be valid input") assert_false(validate_input("bad"), msg="'bad' should be invalid input") def test_convert_input(): assert_equal(convert_input("input1"), "output1") assert_equal(convert_input("input2"), "output2") def test_convert_input_error(): with assert_raises(): _ = convert_input("garbage") ``` The unique identity of a unit test consists of the path of the test file and the name of the test function, separated by `::`. So the test IDs from the example above are: - `test_my_target_module.mojo::test_validate_input()` - `test_my_target_module.mojo::test_convert_input()` - `test_my_target_module.mojo::test_convert_error()` ## The `mojo test` command The `mojo` command line interface includes the [`mojo test`](/mojo/cli/test) command for running tests or collecting a list of tests. ### Running tests By default, the `mojo test` command runs the tests that you specify using one of the following: - A single test ID with either an absolute or relative file path, to run only that test. - A single absolute or relative file path, to run all tests in that file. - A single absolute or relative directory path, to recurse through that directory hierarchy and run all tests found. If needed, you can optionally use the `-I` option one or more times to append additional paths to the list of directories searched to import Mojo modules and packages. Consider the [example testing project](https://github.com/modular/modular/tree/main/examples/mojo/testing) in GitHub, which has the following directory structure: ```output . ├── src │   ├── example.mojo │   └── my_math │   ├── __init__.mojo │   └── utils.mojo └── test └── my_math ├── test_dec.mojo └── test_inc.mojo ``` From the project root directory, you can execute all of the tests in the `test` directory like this: ```bash mojo test -I src test ``` ```output Testing Time: 3.433s Total Discovered Tests: 4 Passed : 4 (100.00%) Failed : 0 (0.00%) Skipped: 0 (0.00%) ``` You can run the tests contained in only the `test_dec.mojo` file like this: ```bash mojo test -I src test/my_math/test_dec.mojo ``` ```output Testing Time: 1.175s Total Discovered Tests: 2 Passed : 2 (100.00%) Failed : 0 (0.00%) Skipped: 0 (0.00%) ``` And you can run a single test from a file by providing its fully qualified ID like this: ```bash mojo test -I src 'test/my_math/test_dec.mojo::test_dec_valid()' ``` ```output Testing Time: 0.66s Total Discovered Tests: 1 Passed : 1 (100.00%) Failed : 0 (0.00%) Skipped: 0 (0.00%) ``` ### Collecting a list of tests By including the `--collect-only` or `--co` option, you can use `mojo test` to discover and print a list of tests. Consider the [example testing project](https://github.com/modular/modular/tree/main/examples/mojo/testing) directory structure shown in the [Running tests](#running-tests) section. The following command produces a list of all of the tests defined in the `test` directory hierarchy. ```bash mojo test --co test ``` The output shows the hierarchy of directories, test files, and individual tests (note that this example elides the full filesystem paths from the output shown): ```output ``` ### Producing JSON formatted output By default `mojo test` produces concise, human-readable output. Alternatively you can produce JSON formatted output more suitable for input to other tools by including the `--diagnostic-format json` option. For example, you can run the tests in the `test_quickstart.mojo` file shown in the [Get started](#get-started) section with JSON formatted output using this command: ```bash mojo test --diagnostic-format json test_quickstart.mojo ``` The output shows the detailed results for each individual test and summary results (note that this example elides the full filesystem paths from the output shown): ```json { "children": [ { "duration_ms": 60, "error": "Unhandled exception caught during execution", "kind": "executionError", "stdErr": "", "stdOut": "Error: At ROOT_DIR/test_quickstart.mojo:8:17: AssertionError: `left == right` comparison failed:\r\n left: 1\r\n right: 0\r\n", "testID": "ROOT_DIR/test_quickstart.mojo::test_inc_zero()" }, { "duration_ms": 51, "error": "", "kind": "success", "stdErr": "", "stdOut": "", "testID": "ROOT_DIR/test_quickstart.mojo::test_inc_one()" } ], "duration_ms": 1171, "error": "", "kind": "executionError", "stdErr": "", "stdOut": "", "testID": "ROOT_DIR/test_quickstart.mojo" } ``` You can also produce JSON output for test collection as well. Consider the [example testing project](https://github.com/modular/modular/tree/main/examples/mojo/testing) directory structure shown in the [Running tests](#running-tests) section. The following command collects a list in JSON format of all of the tests defined in the `test` directory hierarchy: ```bash mojo test --diagnostic-format json --co test ``` The output will appear as follows (note that this example elides the full filesystem paths from the output shown): ```json { "children": [ { "children": [ { "id": "ROOT_DIR/test/my_math/test_dec.mojo::test_dec_valid()", "location": { "endColumn": 5, "endLine": 19, "startColumn": 5, "startLine": 19 } }, { "id": "ROOT_DIR/test/my_math/test_dec.mojo::test_dec_min()", "location": { "endColumn": 5, "endLine": 24, "startColumn": 5, "startLine": 24 } } ], "id": "ROOT_DIR/test/my_math/test_dec.mojo" }, { "children": [ { "id": "ROOT_DIR/test/my_math/test_inc.mojo::test_inc_valid()", "location": { "endColumn": 5, "endLine": 19, "startColumn": 5, "startLine": 19 } }, { "id": "ROOT_DIR/test/my_math/test_inc.mojo::test_inc_max()", "location": { "endColumn": 5, "endLine": 24, "startColumn": 5, "startLine": 24 } } ], "id": "ROOT_DIR/test/my_math/test_inc.mojo" } ], "id": "ROOT_DIR/test/my_math" } ``` --- ## Thread In GPU programming, a thread is the smallest unit of execution within a [kernel](kernel.mdx) function. Threads are grouped into [thread blocks](thread-block.mdx), which are further organized into a [grid](grid.mdx). The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Each block within the grid is assigned a unique [block index](block-index.mdx) that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique [thread index](thread-index.mdx) that determines its position within the block. The GPU assigns each thread block within the grid to a [streaming multiprocessor](streaming-multiprocessor.mdx) (SM) for execution. The SM groups the threads within a block into fixed-size subsets called [warps](warp.mdx), consisting of either 32 or 64 threads each depending on the particular GPU architecture. The SM's warp scheduler manages the execution of warps on the SM's cores. The SM allocates a set of [registers](register.mdx) for each thread to store and process values private to that thread. The registers are associated with that thread throughout its lifetime, even if the thread is not currently executing on the SM's cores (for example, if it is blocked waiting for data from memory). Each thread also has access to [local memory](memory.mdx) to store statically allocated arrays, spilled registers, and other elements of the thread's call stack. Threads within a block can share data through shared memory and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. --- ## Thread block In GPU programming, a thread block is a subset of threads within a [grid](grid.mdx), which is the top-level organizational structure of the [threads](thread.mdx) executing a [kernel](kernel.mdx) function. As the primary building block for workload distribution, thread blocks serve multiple crucial purposes: - First, they break down the overall workload — managed by the grid — of a kernel function into smaller, more manageable portions that can be processed independently. This division allows for better resource utilization and scheduling flexibility across multiple [streaming multiprocessors](streaming-multiprocessor.mdx) (SMs) in the GPU. - Second, thread blocks provide a scope for threads to collaborate through shared memory and synchronization primitives, enabling efficient parallel algorithms and data sharing patterns. - Finally, thread blocks help with scalability by allowing the same program to run efficiently across different GPU architectures, as the hardware can automatically distribute blocks based on available resources. The programmer specifies the number of thread blocks in a grid and how they are arranged across one, two, or three dimensions. Each block within the grid is assigned a unique [block index](block-index.mdx) that determines its position within the grid. Similarly, the programmer also specifies the number of threads per thread block and how they are arranged across one, two, or three dimensions. Each thread within a block is assigned a unique [thread index](thread-index.mdx) that determines its position within the block. The GPU assigns each thread block within the grid to a streaming multiprocessor (SM) for execution. The SM groups the threads within a block into fixed-size subsets called [warps](warp.mdx), consisting of either 32 or 64 threads each depending on the particular GPU architecture. The SM's warp scheduler manages the execution of warps on the SM's cores. Threads within a block can share data through [shared memory](memory.mdx) and synchronize using built-in mechanisms, but they cannot directly communicate with threads in other blocks. --- ## Thread index In GPU programming, a thread index uniquely identifies the position of a [thread](thread.mdx) within a particular [thread block](thread-block.mdx) executing a [kernel](kernel.mdx) function on the GPU. A thread block is a subset of threads in a [grid](grid.mdx), which is the top-level organizational structure of the threads executing a kernel function. Each block within the grid is also assigned a unique block index, which identifies the block's position within the grid. The combination of block index and thread index uniquely identifies the thread's overall position within the grid, and is used to determine which part of the problem each thread should work on. Because a programmer can arrange threads within a thread block across one, two, or three dimensions, a thread index is a 3-element vector of x, y, and z coordinates. For 2-dimensional arrangements, the z coordinate of all thread indices is 0, and for 1-dimensional arrangements, both the y and z coordinates of all thread indices are 0. --- ## threadfence `threadfence[scope: Scope = Scope(5)]()` Enforces ordering of memory operations across threads. Acts as a memory fence/barrier that ensures all memory operations (both loads and stores) issued before the fence are visible to other threads within the specified scope before any memory operations after the fence. Note: * Maps directly to CUDA `__threadfence()` family of functions. * Critical for synchronizing memory access in parallel algorithms. * Performance impact increases with broader scopes. **Parameters:** * ​scope (`Scope`): Memory scope level for the fence. Defaults to GPU-wide scope. Valid values are: * Scope.BLOCK: Orders memory within a thread block/CTA. * Scope.GPU: Orders memory across all threads on the GPU (default). * Scope.SYSTEM: Orders memory across the entire system. --- ## ThreadScope `@register_passable(trivial)` `struct ThreadScope` Represents the scope of thread operations in GPU programming. This struct defines the scope at which thread operations are performed, particularly for operations like tensor distribution and synchronization. It provides two main scopes: `BLOCK` and `WARP`, which correspond to different levels of thread grouping in GPU programming models. Example: ```mojo from layout.layout_tensor import copy_dram_to_sram, ThreadScope # Distribute tensor at block level (all threads in block participate) copy_dram_to_sram[layout, thread_scope=ThreadScope.BLOCK](dst, src) # Distribute tensor at warp level (only threads in same warp participate) copy_dram_to_sram[layout, thread_scope=ThreadScope.WARP](dst, src) ``` Performance: * WARP scope operations typically have lower synchronization overhead than BLOCK scope operations. * BLOCK scope operations allow coordination across all threads in a block, which is necessary for certain algorithms. * The choice of scope can significantly impact performance and correctness of parallel algorithms. Notes: * The appropriate scope depends on the specific algorithm and hardware. * WARP scope operations may be more efficient for operations that only require coordination within a warp. * BLOCK scope operations are necessary when threads from different warps need to coordinate. * The actual size of a warp or block is hardware-dependent. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `BLOCK` `alias BLOCK = ThreadScope(0)` Represents operations at the thread block level, where all threads in a block participate. ### `WARP` `alias WARP = ThreadScope(1)` Represents operations at the warp level, where only threads within the same warp participate. ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` Initialize a `ThreadScope` with the given integer value. **Args:** * ​value (`Int`): An integer representing the thread scope (0 for `BLOCK`, 1 for `WARP`). ### `__eq__` `__eq__(self, other: Self) -> Bool` Compare two `ThreadScope` objects for equality. **Args:** * ​other (`Self`): Another `ThreadScope` object to compare with. **Returns:** True if the thread scopes are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Compare two `ThreadScope` objects for inequality. **Args:** * ​other (`Self`): Another `ThreadScope` object to compare with. **Returns:** True if the thread scopes are not equal, False otherwise. ### `__str__` `__str__(self) -> String` Convert the `ThreadScope` to a human-readable string representation. Aborts: If the thread scope has an invalid value. **Returns:** A string representation of the thread scope ("BLOCK" or "WARP"). ### `__int__` `__int__(self) -> Int` Convert the `ThreadScope` to an integer value. **Returns:** The integer value of the thread scope (0 for BLOCK, 1 for WARP). --- ## ThroughputMeasure `struct ThroughputMeasure` Records a throughput metric of metric BenchMetric and value. ## Fields * ​metric (`BenchMetric`): Type of throughput metric. * ​value (`Int`): Measured count of throughput metric. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, name: String, value: Int, reference: List[BenchMetric] = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple()))` Creates a `ThroughputMeasure` based on metric's name. Example: For the default bench metrics `BenchMetric.DEFAULTS` the following are equivalent: \- `ThroughputMeasure(BenchMetric.fmas, 1024)` \- `ThroughputMeasure("fmas", 1024)` \- `ThroughputMeasure("fmas", 1024, BenchMetric.DEFAULTS)` **Args:** * ​name (`String`): The name of BenchMetric in its corresponding reference. * ​value (`Int`): The measured value to assign to this metric. * ​reference (`List[BenchMetric]`): List of BenchMetrics that contains this metric. `__init__(out self, *, other: Self)` Explicitly construct a deep copy of the provided value. **Args:** * ​other (`Self`): The value to copy. ### `__str__` `__str__(self) -> String` Gets a string representation of this `ThroughputMeasure`. **Returns:** The string represntation. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this ThroughputMeasure to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `compute` `compute(self, elapsed_sec: SIMD[float64, 1]) -> SIMD[float64, 1]` Computes throughput rate for this metric per unit of time (second). **Args:** * ​elapsed\_sec (`SIMD[float64, 1]`): Elapsed time measured in seconds. **Returns:** The throughput values as a floating point 64. --- ## tile ## Functions * [​`tile`](./tile): Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast. * [​`tile_shape`](./tile_shape): Compute the output shape of a `tile` operation, and assert the inputs are compatible. --- ## tile `tile[type: DType, type_repeats: DType](input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats: LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])` Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast. **Parameters:** * ​type (`DType`): Type of the input and output tensors. * ​type\_repeats (`DType`): Type of the repeats tensor. **Args:** * ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. Currently repeats (`LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): One-dimensional tensor that specifies the number of repeated copies along each of the input's dimensions. Length equals input tensor rank. * ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. Has the same dimensions and type as input. --- ## tile `tile[: origin.set, //, workgroup_function: fn[Int](Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)` A generator that launches work groups in specified list of tile sizes. A workgroup function is a function that can process a configurable consecutive "tile" of workload. E.g. `work_on[3](5)` should launch computation on item 5,6,7, and should be semantically equivalent to `work_on[1](5)`, `work_on[1](6)`, `work_on[1](7)`. This generator will try to proceed with the given list of tile sizes on the listed order. E.g. `tile[func, (3,2,1)](offset, upperbound)` will try to call `func[3]` starting from offset until remaining work is less than 3 from upperbound and then try `func[2]`, and then `func[1]`, etc. **Parameters:** * ​workgroup\_function (`fn[Int](Int) capturing -> None`): Workgroup function that processes one tile of workload. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. `tile[: origin.set, //, workgroup_function: fn(Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])` A generator that launches work groups in specified list of tile sizes. This is the version of tile generator for the case where work\_group function can take the tile size as a runtime value. **Parameters:** * ​workgroup\_function (`fn(Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. `tile[: origin.set, //, secondary_tile_size_list: VariadicList[Int], secondary_cleanup_tile: Int, workgroup_function: fn[Int](Int, Int) capturing -> None](offset: Int, upperbound: Int, primary_tile_size_list: VariadicList[Int], primary_cleanup_tile: Int)` A generator that launches work groups in specified list of tile sizes until the sum of primary\_tile\_sizes has exceeded the upperbound. **Parameters:** * ​secondary\_tile\_size\_list (`VariadicList[Int]`): List of static tile sizes to launch work. * ​secondary\_cleanup\_tile (`Int`): Last static tile to use when primary tile sizes don't fit exactly within the upperbound. * ​workgroup\_function (`fn[Int](Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. * ​primary\_tile\_size\_list (`VariadicList[Int]`): List of dynamic tile sizes to launch work. * ​primary\_cleanup\_tile (`Int`): Last dynamic tile to use when primary tile sizes don't fit exactly within the upperbound. `tile[: origin.set, //, workgroup_function: fn[Int, Int](Int, Int) capturing -> None, tile_sizes_x: VariadicList[Int], tile_sizes_y: VariadicList[Int]](offset_x: Int, offset_y: Int, upperbound_x: Int, upperbound_y: Int)` Launches workgroup\_function using the largest tile sizes possible in each dimension, starting from the x and y offset, until the x and y upperbounds are reached. **Parameters:** * ​workgroup\_function (`fn[Int, Int](Int, Int) capturing -> None`): Function that is invoked for each tile and offset. * ​tile\_sizes\_x (`VariadicList[Int]`): List of tile sizes to use for the first parameter of workgroup\_function. * ​tile\_sizes\_y (`VariadicList[Int]`): List of tile sizes to use for the second parameter of workgroup\_function. **Args:** * ​offset\_x (`Int`): Initial x offset passed to workgroup\_function. * ​offset\_y (`Int`): Initial y offset passed to workgroup\_function. * ​upperbound\_x (`Int`): Max offset in x dimension passed to workgroup function. * ​upperbound\_y (`Int`): Max offset in y dimension passed to workgroup function. --- ## tile_and_unswitch `tile_and_unswitch[: origin.set, //, workgroup_function: fn[Int, Bool](Int, Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)` Performs time and unswitch functional transformation. A variant of static tile given a workgroup function that can be unswitched. This generator is a fused version of tile and unswitch, where the static unswitch is true throughout the "inner" portion of the workload and is false only on the residue tile. **Parameters:** * ​workgroup\_function (`fn[Int, Bool](Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. `tile_and_unswitch[: origin.set, //, workgroup_function: fn[Bool](Int, Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])` Performs time and unswitch functional transformation. A variant of dynamic tile given a workgroup function that can be unswitched. This generator is a fused version of tile and unswitch, where the static unswitch is true throughout the "inner" portion of the workload and is false only on the residue tile. **Parameters:** * ​workgroup\_function (`fn[Bool](Int, Int, Int) capturing -> None`): Workgroup function that processes one tile of workload. **Args:** * ​offset (`Int`): The initial index to start the work from. * ​upperbound (`Int`): The runtime upperbound that the work function should not exceed. * ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work. --- ## tile_layout_k_major `tile_layout_k_major[type: DType, BM: Int, BK: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout` Creates a K-major layout for tensor core operations. Constructs a layout optimized for K-major access patterns in tensor core operations, with optional swizzling for improved memory access patterns. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​BM (`Int`): Size of the M dimension in the tile. * ​BK (`Int`): Size of the K dimension in the tile. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE). **Returns:** `Layout` - A K-major layout configured for the specified dimensions and swizzle mode. --- ## tile_layout_mn_major `tile_layout_mn_major[type: DType, mn_dim: Int, k_dim: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout` Creates an MN-major layout for tensor core operations. Constructs a unit layout optimized for MN-major access patterns in shared memory, with optional swizzling for improved memory access patterns. Note: This returns the "unit" layout; the actual shared memory layout can be a multiple of this unit. Currently only supports SWIZZLE\_NONE and SWIZZLE\_128B modes. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​mn\_dim (`Int`): Size of the MN dimension. * ​k\_dim (`Int`): Size of the K dimension. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE). **Returns:** `Layout` - An MN-major layout configured for the specified dimensions and swizzle mode. --- ## tile_middle_unswitch_boundaries `tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool](Int) capturing -> None, middle_tile_sizes: VariadicList[Int], left_tile_size: Int = 1, right_tile_size: Int = 1](left_boundary_start: Int, left_boundary_end: Int, right_boundary_start: Int, right_boundary_end: Int)` Divides 1d iteration space into three parts and tiles them with different steps. The 1d iteration space is divided into: 1\. \[left\_boundary\_start, left\_boundary\_end), effected by left boundary. 2\. \[left\_boundary\_end, right\_boundary\_start), not effected by any boundary. 3\. \[right\_boundary\_start, right\_boundary\_end), effected by right boundary. work\_fn's switch is true for the left and right boundaries, implying boundary conditions like padding in convolution. The middle part is tiled with static tile sizes with the switch as false. `middle_tile_sizes` should be in descending order for optimal performance. (Larger tile size appeared later in the list fails the while-loop.) **Parameters:** * ​work\_fn (`fn[Int, Bool](Int) capturing -> None`): Work function that processes one tile of workload. * ​middle\_tile\_sizes (`VariadicList[Int]`): List of tile sizes for the middle part. * ​left\_tile\_size (`Int`): Tile size for the left boundary region. * ​right\_tile\_size (`Int`): Tile size for the right boundary region. **Args:** * ​left\_boundary\_start (`Int`): Start index of the left boundary. * ​left\_boundary\_end (`Int`): End index of the left boundary. * ​right\_boundary\_start (`Int`): Start index of the right boundary. * ​right\_boundary\_end (`Int`): End index of the right boundary. `tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool, Bool](Int) capturing -> None, tile_size: Int, size: Int]()` Tile 1d iteration space with boundary conditions at both ends. This generator is primarily for convolution with static shapes. `work_fn`'s flags hints the function to handle padding at the boundary. The size is the static output row size, i.e., WO dimension. **Parameters:** * ​work\_fn (`fn[Int, Bool, Bool](Int) capturing -> None`): Work function that updates one tile. It has two flags for left and right boundaries, respectively. * ​tile\_size (`Int`): 1D Tile size. * ​size (`Int`): Iteration range is \[0, size). --- ## tile_shape `tile_shape[input_type: DType, repeats_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats_buf: LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]` Compute the output shape of a `tile` operation, and assert the inputs are compatible. **Parameters:** * ​input\_type (`DType`): Type of the input tensor. * ​repeats\_type (`DType`): Type of the repeats tensor. * ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run synchronously using a single thread. **Args:** * ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. * ​repeats\_buf (`LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The repeats tensor. **Returns:** The output shape. --- ## tile_to_descriptor `tile_to_descriptor[type: DType, layout: Layout, is_k_major: Bool = True]() -> Layout` Transforms a layout into a WGMMA descriptor-compatible layout. Converts a standard layout into a form that can be used with WGMMA descriptors, handling both K-major and MN-major layouts differently. **Parameters:** * ​type (`DType`): Element data type of the tensor. * ​layout (`Layout`): Input layout to transform. * ​is\_k\_major (`Bool`): Whether the layout is K-major (True) or MN-major (False). **Returns:** \`Layout - A transformed layout compatible with WGMMA descriptors. --- ## tile_to_shape `tile_to_shape(tile: Layout, target_shape: IntTuple[origin], order: Optional[IntTuple] = Optional(None)) -> Layout` Creates a layout by tiling a base layout to match a target shape. This function creates a hierarchical layout by repeating a tile layout to match a target shape. It calculates how many times the tile needs to be repeated in each dimension to reach the target shape, and creates a tiler layout with this information. Example: ```mojo from layout import Layout, IntTuple from layout.layout import tile_to_shape # Create a 2x2 tile layout var tile = Layout.row_major(2, 2) # Tile it to create a 6x4 layout var tiled = tile_to_shape(tile, IntTuple(6, 4)) # Result: A layout with 3x2 tiles of size 2x2 each ``` . **Args:** * ​tile (`Layout`): The base layout to be tiled. * ​target\_shape (`IntTuple[origin]`): The desired final shape to tile to. * ​order (`Optional[IntTuple]`): Optional memory ordering for the tiler layout. If None, defaults to column-major ordering. **Returns:** A new layout representing the tiled structure that matches the target shape. --- ## tileconfig `struct tileconfig` ## Fields * ​pavarte\_id (`SIMD[uint8, 1]`): * ​start\_row (`SIMD[uint8, 1]`): * ​reserved (`StaticTuple[scalar, 14]`): * ​colb (`StaticTuple[scalar, 16]`): * ​rows (`StaticTuple[scalar, 16]`): ## Implemented traits `AnyType`, `UnknownDestructibility` --- ## tiled_matmul_run `tiled_matmul_run[config: KernelConfig, transpose_b: Bool, b_packed: Bool, simd_size: Int, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, algorithm: InnerMatmulKernel](alg: algorithm, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], elementwise_epilogue_fn: fn(GemmShape, GemmShape) escaping -> None, global_tile_shape: GemmShape, global_tile_offset: GemmShape)` Interface function to run tiled matmul on a given sub-tile. **Args:** * ​alg (`algorithm`): InnerMatmulKernel algorithm for microkernel. * ​c (`NDBuffer[type, 2, origin, shape]`): Pre-allocated buffer space for result. * ​a (`NDBuffer[type, 2, origin, shape]`): Operand A of the matmul. * ​b (`NDBuffer[type, 2, origin, shape]`): Operand B of the mamtul. * ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`): The elementwise epilogue function. * ​global\_tile\_shape (`GemmShape`): Tile shape this call will process. * ​global\_tile\_offset (`GemmShape`): Tile offset on the original buffer. --- ## TiledMatmul `struct TiledMatmul[a_mut: Bool, b_mut: Bool, //, config: KernelConfig, transpose_b: Bool, b_packed: Bool, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, a_type: DType, a_shape: DimList, a_origin: Origin[a_mut], b_type: DType, b_shape: DimList, b_origin: Origin[b_mut], c_type: DType, c_shape: DimList, c_origin: MutableOrigin, algorithm: InnerMatmulKernel]` Tiled matmul implementation integrating packing, inner loop and tile partitions. TODO: add tag based implementation dispatch. TODO: add fusion hooks. ## Fields * ​alg (`algorithm`): * ​c (`NDBuffer[c_type, 2, c_origin, c_shape]`): * ​a (`NDBuffer[a_type, 2, a_origin, a_shape]`): * ​b (`NDBuffer[b_type, 2, b_origin, b_shape]`): * ​tile\_n\_k (`IndexList[2]`): * ​global\_tile\_offset (`GemmShape`): * ​global\_tile\_shape (`GemmShape`): * ​b\_tile\_generator (`BTileGenerator[config, a_type, b_type, c_type, b_shape, transpose_b, b_packed, b_origin]`): * ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` --- ## TileMaskStatus `@register_passable(trivial)` `struct TileMaskStatus` A tile's masking status. ## Fields * ​status (`SIMD[uint8, 1]`): ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `FULL_MASK` `alias FULL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](3))` ### `NO_MASK` `alias NO_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](0))` ### `PARTIAL_MASK` `alias PARTIAL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](1))` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` ### `__ne__` `__ne__(self, rhs: Self) -> Bool` ### `__is__` `__is__(self, rhs: Self) -> Bool` ### `__and__` `__and__(self, rhs: Self) -> Self` ### `__or__` `__or__(self, rhs: Self) -> Self` ### `__is_not__` `__is_not__(self, rhs: Self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## TileScheduler `@register_passable(trivial)` `struct TileScheduler[tile_shape: IndexList[3], grid_shape: IndexList[2], cluster: IndexList[3] = Index(1, 1, 1), raster_dim: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))]` ## Fields * ​idx (`SIMD[uint32, 1]`): * ​prob\_shape (`IndexList[3]`): * ​num\_waves\_m (`SIMD[uint32, 1]`): * ​num\_waves\_n (`SIMD[uint32, 1]`): * ​log\_num\_waves\_n (`FastDiv[uint32]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `num_grids` `alias num_grids = SIMD((grid_shape.__getitem__[::Indexer](0) * grid_shape.__getitem__[::Indexer](1)))` ### `wave_shape` `alias wave_shape = Index((grid_shape.__getitem__[::Indexer](1) * tile_shape.__getitem__[::Indexer](0)), (grid_shape.__getitem__[::Indexer](0) * tile_shape.__getitem__[::Indexer](1)))` ## Methods ### `__init__` `__init__(prob_shape: IndexList[3]) -> Self` ### `get_current_work_info` `get_current_work_info(self) -> WorkInfo` ### `advance` `advance(mut self)` ### `fetch_next_work` `fetch_next_work(mut self) -> WorkInfo` ### `num_output_tiles` `num_output_tiles(self) -> UInt` --- ## TileScheduler `@register_passable(trivial)` `struct TileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHATileScheduler`, `Movable`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance = True` ### `mha_schedule` `alias mha_schedule = schedule` ## Methods ### `__init__` `__init__() -> Self` ### `get_current_work_info` `get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` ### `fetch_next_work` `fetch_next_work(self, ts: MHATileSummary, mut state: MHATileState) -> WorkInfo` ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `initial_state` `initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## time Implements the time package. ## Modules * [​`time`](/mojo/stdlib/time/time/): Implements basic utils for working with time. --- ## time Implements basic utils for working with time. You can import these APIs from the `time` package. For example: ```mojo from time import perf_counter_ns ``` ## Functions * [​`monotonic`](/mojo/stdlib/time/time/monotonic): Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation. * [​`perf_counter`](/mojo/stdlib/time/time/perf_counter): Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. * [​`perf_counter_ns`](/mojo/stdlib/time/time/perf_counter_ns): Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid. * [​`sleep`](/mojo/stdlib/time/time/sleep): Suspends the current thread for the seconds specified. * [​`time_function`](/mojo/stdlib/time/time/time_function): Measures the time spent in the function. --- ## time_function `time_function[: origin.set, //, func: fn() raises capturing -> None]() -> UInt` Measures the time spent in the function. **Parameters:** * ​func (`fn() raises capturing -> None`): The function to time. **Returns:** The time elapsed in the function in ns. `time_function[: origin.set, //, func: fn() capturing -> None]() -> UInt` Measures the time spent in the function. **Parameters:** * ​func (`fn() capturing -> None`): The function to time. **Returns:** The time elapsed in the function in ns. --- ## tma_async Tensor Memory Accelerator (TMA) Asynchronous Operations Module Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA), enabling efficient asynchronous data movement between global and shared memory in GPU kernels. It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions. ## Key Components: * `TMATensorTile`: Core struct that encapsulates a TMA descriptor for efficient data transfers between global and shared memory with various access patterns and optimizations. * `SharedMemBarrier`: Synchronization primitive for coordinating asynchronous TMA operations, ensuring data transfers complete before dependent operations begin. * `PipelineState`: Helper struct for managing multi-stage pipeline execution with circular buffer semantics, enabling efficient double or triple buffering techniques. * `create_tma_tile`: Factory functions for creating optimized `TMATensorTile` instances with various configurations for different tensor shapes and memory access patterns. ## Structs * [​`PipelineState`](./PipelineState): Manages state for a multi-stage pipeline with circular buffer semantics. * [​`SharedMemBarrier`](./SharedMemBarrier): A hardware-accelerated synchronization primitive for GPU shared memory operations. * [​`TMATensorTile`](./TMATensorTile): A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement. * [​`TMATensorTileArray`](./TMATensorTileArray): An array of TMA descripotr. ## Functions * [​`create_tma_tile`](./create_tma_tile): Creates a `TMATensorTile` with specified tile dimensions and swizzle mode. --- ## tma_store_fence `tma_store_fence()` Establishes a memory fence for shared memory stores in TMA operations. This function creates a memory barrier that ensures all previous shared memory stores are completed before subsequent TMA (Tensor Memory Access) store operations begin. This is crucial for maintaining memory consistency in tensor operations. Note: This fence specifically targets the CTA (Cooperative Thread Array) scope and is used to synchronize async shared memory operations. --- ## tma_wgmma_warp_specialized_gemm_kernel `tma_wgmma_warp_specialized_gemm_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])` --- ## tma_wgmma_warp_specialized_gemm_kernel_persistent `tma_wgmma_warp_specialized_gemm_kernel_persistent[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], grid_shape: IndexList[2], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], problem_shape: IndexList[3])` --- ## TMATensorTile `struct TMATensorTile[dtype: DType, layout: Layout, desc_layout: Layout = layout]` A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement. The TMATensorTile struct provides a high-performance interface for asynchronous data transfers between global memory and shared memory in GPU tensor operations. It encapsulates a TMA descriptor that defines the memory access pattern and provides methods for various asynchronous operations. Performance: * Hardware-accelerated memory transfers using TMA instructions * Supports prefetching of descriptors for latency hiding * Enforces 128-byte alignment requirements for optimal memory access ## Parameters * ​dtype (`DType`): DType The data type of the tensor elements. * ​layout (`Layout`): Layout The layout of the tile in shared memory, typically specified as row\_major. * ​desc\_layout (`Layout`): Layout = layout The layout of the descriptor, which can be different from the shared memory layout to accommodate hardware requirements like WGMMA. ## Fields * ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern. This field stores the hardware descriptor that encodes information about: * The source tensor's memory layout and dimensions * The tile shape and access pattern * Swizzling configuration for optimal memory access The descriptor is used by the GPU's Tensor Memory Accelerator hardware to efficiently transfer data between global and shared memory. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(out self, descriptor: TMADescriptor)` Initializes a new TMATensorTile with the provided TMA descriptor. **Args:** * ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy initializes this `TMATensorTile` from another instance. **Args:** * ​other (`Self`): The other `TMATensorTile` instance to copy from. ### `prefetch_descriptor` `prefetch_descriptor(self)` Prefetches the TMA descriptor into cache to reduce latency. This method helps hide memory access latency by prefetching the descriptor before it's needed for actual data transfers. ### `async_copy` `async_copy(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt])` Schedules an asynchronous copy from global memory to shared memory at specified coordinates. This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory. The transfer is tracked by the provided memory barrier. **Constraints:** * The destination tensor must be 128-byte aligned in shared memory. * The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. * ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer. * ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the source tensor from which to copy data. ### `async_copy_3d` `async_copy_3d(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt, UInt])` Schedules an asynchronous copy from global memory to shared memory at specified 3D coordinates. This method initiates a hardware-accelerated asynchronous transfer of data from global memory to the specified destination in shared memory for 3D tensors. The transfer is tracked by the provided memory barrier. **Constraints:** * The destination tensor must be 128-byte aligned in shared memory. * The descriptor layout may be smaller than the shared memory tile shape to accommodate hardware requirements. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. * ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer. * ​coords (`Tuple[UInt, UInt, UInt]`): The 3D coordinates in the source tensor from which to copy data. ### `async_multicast_load` `async_multicast_load(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt], multicast_mask: SIMD[uint16, 1])` Schedules an asynchronous multicast load from global memory to multiple shared memory locations. This method initiates a hardware-accelerated asynchronous transfer of data from global memory to multiple destination locations in shared memory across different CTAs (Cooperative Thread Arrays) as specified by the multicast mask. **Constraints:** The destination tensor must be 128-byte aligned in shared memory. **Args:** * ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor The destination tensor in shared memory where data will be copied. Must be 128-byte aligned. * ​mem\_barrier (`SharedMemBarrier`): SharedMemBarrierArray The memory barrier used to track and synchronize the asynchronous transfer. * ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt] The 2D coordinates in the source tensor from which to copy data. * ​multicast\_mask (`SIMD[uint16, 1]`): UInt16 A bit mask specifying which CTAs should receive the data. ### `async_store` `async_store(self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])` Schedules an asynchronous store from shared memory to global memory. This method initiates a hardware-accelerated asynchronous transfer of data from shared memory to global memory at the specified coordinates. **Constraints:** The source tensor must be 128-byte aligned in shared memory. **Args:** * ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor The source tensor in shared memory from which data will be copied. Must be 128-byte aligned. * ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt] The 2D coordinates in the destination tensor where data will be stored. ### `async_reduce` `async_reduce[reduction_kind: ReduceOp](self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])` Schedules an asynchronous reduction operation from shared memory to global memory. This method initiates a hardware-accelerated asynchronous reduction operation that combines data from shared memory with data in global memory using the specified reduction operation. The reduction is performed element-wise at the specified coordinates in the global tensor. **Constraints:** The source tensor must be 128-byte aligned in shared memory. **Parameters:** * ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform (e.g., ADD, MIN, MAX). This determines how values are combined during the reduction. **Args:** * ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in shared memory containing the data to be reduced. Must be 128-byte aligned. * ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the destination tensor where the reduction will be applied. ### `commit_group` `commit_group(self)` Commits all prior initiated but uncommitted TMA instructions into a group. This function behaves the same as `cp_async_bulk_commit_group`, which creates a synchronization point for bulk TMA transfer. ### `wait_group` `wait_group[n: Int = 0](self)` Wait for the completion of asynchronous copy until a specified number of groups are waiting. This function behaves the same as `cp_async_bulk_wait_group`, which causes the executing thread to wait until a specified number of the most recent TMA copy are pending. **Parameters:** * ​n (`Int`): The number of pending groups left. ### `smem_tensormap_init` `smem_tensormap_init(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])` Initializes a TMA descriptor in shared memory from this tensor tile's descriptor. This method copies the TMA descriptor from global memory to shared memory, allowing for faster access during kernel execution. The descriptor is copied in 16-byte chunks using asynchronous copy operations for efficiency. Note: * Only one thread should call this method to avoid race conditions * The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes) **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the location in shared memory where the descriptor will be stored. Must be properly aligned. ### `replace_tensormap_global_address_in_gmem` `replace_tensormap_global_address_in_gmem[dtype: DType](self, src_ptr: UnsafePointer[SIMD[dtype, 1]])` Replaces the global memory address in the TMA descriptor stored in global memory. This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies the descriptor in global memory directly. Note: A memory fence may be required after this operation to ensure visibility of the changes to other threads. **Parameters:** * ​dtype (`DType`): The data type of the new source tensor. **Args:** * ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor. Must have compatible layout with the original tensor. ### `tensormap_fence_acquire` `tensormap_fence_acquire(self)` Establishes a memory fence for TMA operations with acquire semantics. This method ensures proper ordering of memory operations by creating a barrier that prevents subsequent TMA operations from executing before prior operations have completed. It is particularly important when reading from a descriptor that might have been modified by other threads or processes. The acquire semantics ensure that all memory operations after this fence will observe any modifications made to the descriptor before the fence. Notes: * The entire warp must call this function as the instruction is warp-aligned. * Typically used in pairs with `tensormap_fence_release` for proper synchronization. ### `tensormap_fence_release` `tensormap_fence_release(self)` Establishes a memory fence for TMA operations with release semantics. This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is particularly important when modifying a TMA descriptor in global memory that might be read by other threads or processes. The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence. Notes: * Typically used after modifying a tensormap descriptor in global memory. * Often paired with `tensormap_fence_acquire` for proper synchronization. ### `replace_tensormap_global_address_in_shared_mem` `replace_tensormap_global_address_in_shared_mem[dtype: DType](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], src_ptr: UnsafePointer[SIMD[dtype, 1]])` Replaces the global memory address in the TMA descriptor stored in shared memory. This method allows dynamically changing the source tensor for TMA operations without recreating the entire descriptor, which is useful for reusing descriptors with different data sources. The operation modifies a descriptor that has been previously copied to shared memory. Notes: * Only one thread should call this method to avoid race conditions. * A memory fence may be required after this operation to ensure visibility of the changes to other threads. * Typically used with descriptors previously initialized with `smem_tensormap_init`. **Parameters:** * ​dtype (`DType`): The data type of the new source tensor. **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified. * ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor. ### `tensormap_cp_fence_release` `tensormap_cp_fence_release(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])` Establishes a memory fence for TMA operations with release semantics for shared memory descriptors. This method ensures proper ordering of memory operations by creating a barrier that ensures all prior memory operations are visible before subsequent operations can proceed. It is specifically designed for synchronizing between global memory and shared memory TMA descriptors. The release semantics ensure that all memory operations before this fence will be visible to any thread that observes operations after the fence. Notes: * The entire warp must call this function as the instruction is warp-aligned * Typically used after modifying a tensormap descriptor in shared memory * More specialized than the general `tensormap_fence_release` for cross-memory space synchronization **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the TMA descriptor in shared memory that is being synchronized with the global memory descriptor. ### `replace_tensormap_global_dim_strides_in_shared_mem` `replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, only_update_dim_0: Bool, /, *, rank: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], gmem_dims: IndexList[rank], gmem_strides: IndexList[rank])` Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5. This function allows dynamically modifying the dimensions and strides of a TMA descriptor that has been previously initialized in shared memory. If only the first dimension (dim 0) is updated, then updating strides can be skipped. Notes: * Only one thread should call this method to avoid race conditions. * A memory fence may be required after this operation to ensure visibility of the changes to other threads. **Parameters:** * ​dtype (`DType`): The data type of the new source tensor. * ​only\_update\_dim\_0 (`Bool`): If true, only the first dimension (dim 0) is updated with updating strides. * ​rank (`Int`): The rank of the tensor. **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified. * ​gmem\_dims (`IndexList[rank]`): The global dimensions of the tensor to be updated. * ​gmem\_strides (`IndexList[rank]`): The global strides of the tensor to be updated. `replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, tensor_rank: Int, dim_idx: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], dim_value: SIMD[uint32, 1], dim_stride: Optional[SIMD[uint64, 1]] = Optional(None))` Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5. This function allows dynamically modifying the dimensions and strides of a TMA descriptor that has been previously initialized in shared memory. If only the first dimension is updated, then updating strides can be skipped. Notes: * Only one thread should call this method to avoid race conditions. * A memory fence may be required after this operation to ensure visibility of the changes to other threads. **Parameters:** * ​dtype (`DType`): The data type of the source tensor in GMEM. * ​tensor\_rank (`Int`): The rank of the source tensor in GMEM. * ​dim\_idx (`Int`): The index of the dimension to be updated in the TMA descriptor with the provided dimension and stride values at runtime. **Args:** * ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified. * ​dim\_value (`SIMD[uint32, 1]`): The new dimension value to be set. * ​dim\_stride (`Optional[SIMD[uint64, 1]]`): The new stride value to be set. --- ## TMATensorTileArray `@register_passable(trivial)` `struct TMATensorTileArray[num_of_tensormaps: Int, dtype: DType, cta_tile_layout: Layout, desc_layout: Layout]` An array of TMA descripotr. ## Parameters * ​num\_of\_tensormaps (`Int`): Int The number of TMA descriptors aka tensor map. * ​dtype (`DType`): DType The data type of the tensor elements. * ​cta\_tile\_layout (`Layout`): Layout The layout of the tile in shared memory, typically specified as row\_major. * ​desc\_layout (`Layout`): Layout The layout of the descriptor, which can be different from the shared memory layout to accommodate hardware requirements like WGMMA. ## Fields * ​tensormaps\_ptr (`UnsafePointer[SIMD[uint8, 1]]`): A static tuple of pointers to TMA descriptors. This field stores an array of pointers to `TMATensorTile` instances, where each pointer references a TMA descriptor in device memory. The array has a fixed size determined by the num\_of\_tensormaps parameter. The TMA descriptors are used by the GPU hardware to efficiently transfer data between global and shared memory with specific memory access patterns defined by the layouts. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `descriptor_bytes` `alias descriptor_bytes = 128` Size of the TMA descriptor in bytes. This is a constant value that represents the size of the TMA descriptor in bytes. It is used to calculate the offset of the TMA descriptor in the device memory. ## Methods ### `__init__` `__init__(out self, tensormaps_device: DeviceBuffer[uint8])` Initializes a new TMATensorTileArray. **Args:** * ​tensormaps\_device (`DeviceBuffer[uint8]`): Device buffer to store TMA descriptors. ### `__getitem__` `__getitem__(self, index: Int) -> UnsafePointer[TMATensorTile[dtype, cta_tile_layout, desc_layout]]` Retrieve a TMA descriptor. **Args:** * ​index (`Int`): Index of the TMA descriptor. **Returns:** `UnsafePointer` to the `TMATensorTile` at the specified index. --- ## to_nest `to_nest(nested: IntTuple[origin], flat: IntTuple[origin]) -> IntTuple` Nests a flat `IntTuple` according to the structure of a nested `IntTuple`. This function reshapes a flat sequence of values into a hierarchical structure that matches the pattern of a template nested `IntTuple`. Example: ```mojo from layout import IntTuple from layout.int_tuple import to_nest var result = to_nest(IntTuple(2, IntTuple(3, 4), 5), IntTuple(1, 2, 3, 4)) # returns IntTuple(1, (2, 3), 4) ``` . **Args:** * ​nested (`IntTuple[origin]`): The template `IntTuple` defining the desired structure. * ​flat (`IntTuple[origin]`): The flat `IntTuple` containing the values to be nested. **Returns:** A new `IntTuple` with the values from flat arranged in the structure of nested. --- ## to_unknown `to_unknown(t: IntTuple[origin]) -> IntTuple` Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`. This function preserves the hierarchical structure of the input `IntTuple` but replaces all integer values with `UNKNOWN_VALUE`. **Args:** * ​t (`IntTuple[origin]`): The template `IntTuple` defining the structure. **Returns:** A new `IntTuple` with the same structure as t but with all values replaced by `UNKNOWN_VALUE`. --- ## Tokenization Tokenization is the process of dividing the input for an AI model into discrete units that have numerical IDs called tokens. Depending on what the input is (such as text, audio, or an image) the tokens might be based on different words or subwords in text, or different slices/blocks of pixels in images. For example, consider the sentence, "The cat sat on the mat." A word-level tokenization might split this sentence into the following words: "The," "cat," "sat," "on," "the," "mat." Then it replaces each word with a token (a number). The token "vocabulary"—the mapping of words to numbers—is predetermined and may vary from model to model. But tokenizers in large language models (LLMs) are much more sophisticated than that. Among other things, they also tokenize punctuations (or combinations of words and punctuations) and break words into subwords that allow them to tokenize words they've never seen before. Because LLMs are trained on these tokens, they don't actually understand words and letters the way we do. They can only recognize and generate information based on the token vocabulary that they were trained upon. (Popular LLMs have a token vocabulary with over 100,000 tokens.) --- ## tokenizer Implementations of provided tokenizers. ## `IdentityPipelineTokenizer` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer} > *class* max.pipelines.lib.tokenizer.IdentityPipelineTokenizer(\*args, \*\*kwargs) ### `decode()` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.decode} > *async* decode(context, encoded, \*\*kwargs) Decodes response tokens to text. **Parameters:** * **context** (`TokenGeneratorContext` ) – Current generation context. * **encoded** (`TokenizerEncoded` ) – Encoded response tokens. **Returns:** Un-encoded response text. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `encode()` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.encode} > *async* encode(prompt, add\_special\_tokens=False) Encodes text prompts as tokens. **Parameters:** * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text. * **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `eos` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.eos} > *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\* The end of sequence token for this tokenizer. ### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.expects_content_wrapping} > *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* If true, this tokenizer expects messages to have a content property. Text messages are formatted as: ```json { "type": "text", "content": "text content" } ``` instead of the OpenAI spec: ```json { "type": "text", "text": "text content" } ``` NOTE: Multimodal messages omit the content property. Both `image_urls` and `image` content parts are converted to: ```json { "type": "image" } ``` Their content is provided as byte arrays through the top-level property on the request object, i.e., `PipelineTokenizerRequest.images`. ## `PreTrainedPipelineTokenizer` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer} > *class* max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer(delegate) **Parameters:** **delegate** (`Union` `[` `PreTrainedTokenizer` `,` `PreTrainedTokenizerFast` `]` ) ### `apply_chat_template()` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.apply_chat_template} > apply\_chat\_template(messages) **Parameters:** **messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](core.md#max.pipelines.core.TokenGeneratorRequestMessage) `]` ) **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `decode()` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.decode} > *async* decode(context, encoded, \*\*kwargs) Decodes response tokens to text. **Parameters:** * **context** (`TokenGeneratorContext` ) – Current generation context. * **encoded** (`TokenizerEncoded` ) – Encoded response tokens. **Returns:** Un-encoded response text. **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `encode()` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.encode} > *async* encode(prompt, add\_special\_tokens=False) Encodes text prompts as tokens. **Parameters:** * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text. * **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Raises:** [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length. **Return type:** [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ### `eos` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.eos} > *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\* The end of sequence token for this tokenizer. ### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.expects_content_wrapping} > *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* If true, this tokenizer expects messages to have a content property. Text messages are formatted as: ```json { "type": "text", "content": "text content" } ``` instead of the OpenAI spec: ```json { "type": "text", "text": "text content" } ``` NOTE: Multimodal messages omit the content property. Both `image_urls` and `image` content parts are converted to: ```json { "type": "image" } ``` Their content is provided as byte arrays through the top-level property on the request object, i.e., `PipelineTokenizerRequest.images`. ## `TextAndVisionTokenizer` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer} > *class* max.pipelines.lib.tokenizer.TextAndVisionTokenizer(model\_path, \*, revision=None, max\_length=None, max\_new\_tokens=None, trust\_remote\_code=False, \*\*unused\_kwargs) Encapsulates creation of TextContext and specific token encode/decode logic. **Parameters:** * **model\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) * **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **trust\_remote\_code** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `apply_chat_template()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.apply_chat_template} > apply\_chat\_template(messages) **Parameters:** **messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](core.md#max.pipelines.core.TokenGeneratorRequestMessage) `]` ) **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `decode()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.decode} > *async* decode(context, encoded, \*\*kwargs) Transformer a provided encoded token array, back into readable text. **Parameters:** * **context** ([`TextAndVisionContext`](core.md#max.pipelines.core.TextAndVisionContext) ) * **encoded** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `encode()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.encode} > *async* encode(prompt, add\_special\_tokens=True) Transform the provided prompt into a token array. **Parameters:** * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ### `eos` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.eos} > *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\* The end of sequence token for this tokenizer. ### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.expects_content_wrapping} > *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* If true, this tokenizer expects messages to have a content property. Text messages are formatted as: ```json { "type": "text", "content": "text content" } ``` instead of the OpenAI spec: ```json { "type": "text", "text": "text content" } ``` NOTE: Multimodal messages omit the content property. Both `image_urls` and `image` content parts are converted to: ```json { "type": "image" } ``` Their content is provided as byte arrays through the top-level property on the request object, i.e., `PipelineTokenizerRequest.images`. ### `new_context()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.new_context} > *async* new\_context(request) Create a new TextAndVisionContext object, leveraging necessary information like cache\_seq\_id and prompt from TokenGeneratorRequest. **Parameters:** **request** ([`TokenGeneratorRequest`](core.md#max.pipelines.core.TokenGeneratorRequest) ) **Return type:** [*TextAndVisionContext*](core.md#max.pipelines.core.TextAndVisionContext) ## `TextTokenizer` {#max.pipelines.lib.tokenizer.TextTokenizer} > *class* max.pipelines.lib.tokenizer.TextTokenizer(model\_path, \*, revision=None, max\_length=None, max\_new\_tokens=None, trust\_remote\_code=False, enable\_llama\_whitespace\_fix=False, \*\*unused\_kwargs) Encapsulates creation of TextContext and specific token encode/decode logic. **Parameters:** * **model\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) * **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` `None` ) * **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **trust\_remote\_code** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) * **enable\_llama\_whitespace\_fix** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) ### `apply_chat_template()` {#max.pipelines.lib.tokenizer.TextTokenizer.apply_chat_template} > apply\_chat\_template(messages, tools, chat\_template\_options=None) **Parameters:** * **messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](core.md#max.pipelines.core.TokenGeneratorRequestMessage) `]` ) * **tools** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestTool`](core.md#max.pipelines.core.TokenGeneratorRequestTool) `]` `|` `None` ) * **chat\_template\_options** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `|` `None` ) **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `decode()` {#max.pipelines.lib.tokenizer.TextTokenizer.decode} > *async* decode(context, encoded, \*\*kwargs) Transformer a provided encoded token array, back into readable text. **Parameters:** * **context** ([`TextContext`](core.md#max.pipelines.core.TextContext) ) * **encoded** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) **Return type:** [str](https://docs.python.org/3/library/stdtypes.html#str) ### `encode()` {#max.pipelines.lib.tokenizer.TextTokenizer.encode} > *async* encode(prompt, add\_special\_tokens=True) Transform the provided prompt into a token array. **Parameters:** * **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) * **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) **Return type:** [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ### `eos` {#max.pipelines.lib.tokenizer.TextTokenizer.eos} > *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\* The end of sequence token for this tokenizer. ### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.TextTokenizer.expects_content_wrapping} > *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* If true, this tokenizer expects messages to have a content property. Text messages are formatted as: ```json { "type": "text", "content": "text content" } ``` instead of the OpenAI spec: ```json { "type": "text", "text": "text content" } ``` NOTE: Multimodal messages omit the content property. Both `image_urls` and `image` content parts are converted to: ```json { "type": "image" } ``` Their content is provided as byte arrays through the top-level property on the request object, i.e., `PipelineTokenizerRequest.images`. ### `new_context()` {#max.pipelines.lib.tokenizer.TextTokenizer.new_context} > *async* new\_context(request) Create a new TextContext object, leveraging necessary information like cache\_seq\_id and prompt from TokenGeneratorRequest. **Parameters:** **request** ([`TokenGeneratorRequest`](core.md#max.pipelines.core.TokenGeneratorRequest) ) **Return type:** [*TextContext*](core.md#max.pipelines.core.TextContext) ## `max_tokens_to_generate()` {#max.pipelines.lib.tokenizer.max_tokens_to_generate} > max.pipelines.lib.tokenizer.max\_tokens\_to\_generate(prompt\_size, max\_length, max\_new\_tokens=None) Returns the max number of new tokens to generate. **Parameters:** * **prompt\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) * **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` `None` ) **Return type:** [int](https://docs.python.org/3/library/functions.html#int) | None ## `run_with_default_executor()` {#max.pipelines.lib.tokenizer.run_with_default_executor} > *async* max.pipelines.lib.tokenizer.run\_with\_default\_executor(fn, \*args) --- ## top_k `top_k[rank: Int, type: DType, out_idx_type: DType, //, largest: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input: NDBuffer[type, rank, origin], k: Int, axis: Int, out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], sorted: Bool, ctx: DeviceContextPtr)` Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis. **Parameters:** * ​rank (`Int`): Rank of the input. * ​type (`DType`): Data type of the input buffer. * ​out\_idx\_type (`DType`): The data type of the output indices (default is DType.int64). * ​largest (`Bool`): Whether to find the maximum (top k) or minimum value (bottom k). * ​target (`StringSlice[StaticConstantOrigin]`): The target to run on. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​k (`Int`): Represents the K largest/smallest value. * ​axis (`Int`): On which axis it should operate. * ​out\_vals (`NDBuffer[type, rank, origin]`): Output values. * ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): Output indices. * ​sorted (`Bool`): Indicates if the top/bottom K elements are in (stable) sorted order. * ​ctx (`DeviceContextPtr`): The device call context. --- ## top_k_fused_sampling_cpu `top_k_fused_sampling_cpu[type: DType, rank: Int, out_idx_type: DType](k: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume. **Parameters:** * ​type (`DType`): Data type of the input buffer. * ​rank (`Int`): Rank of the input. * ​out\_idx\_type (`DType`): Data type of the output indices. **Args:** * ​k (`Int`): Int - Represents the K largest values to consider for sampling. * ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] (Any shape)- The input tensor. * ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[out\_idx\_type, rank] (shape of \[input\_shape\[:-1]] + \[1]) - The output indices. * ​temperature (`SIMD[type, 1]`): The temperature based scaling. --- ## top_k_shape `top_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], k: Int, axis: Int) -> IndexList[rank]` --- ## top_k_shape_impl `top_k_shape_impl[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], k: Int, axis: Int) -> IndexList[rank]` Compute the output shape of a top/bottom k operation. **Parameters:** * ​type (`DType`): Data type of the input buffer. * ​rank (`Int`): Rank of the input. * ​single\_thread\_blocking\_override (`Bool`): If this function can block. **Args:** * ​input (`NDBuffer[type, rank, origin]`): The input tensor. * ​k (`Int`): The K value in a tensor. * ​axis (`Int`): The axis value in a tensor. **Returns:** The output shape. --- ## top_p_sampling `top_p_sampling[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](top_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P). --- ## top_p_sampling_gpu `top_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, top_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P). --- ## topk ## Aliases ### `SEED` `alias SEED = 0` ## Structs * [​`TopK_2`](./TopK_2): ## Functions * [​`bottom_k_shape`](./bottom_k_shape): * [​`top_k`](./top_k): Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis. * [​`top_k_fused_sampling_cpu`](./top_k_fused_sampling_cpu): Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume. * [​`top_k_shape`](./top_k_shape): * [​`top_k_shape_impl`](./top_k_shape_impl): Compute the output shape of a top/bottom k operation. * [​`topk_fused_sampling_gpu`](./topk_fused_sampling_gpu): Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume. * [​`topk_gpu`](./topk_gpu): Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor. --- ## TopK_2 `@register_passable(trivial)` `struct TopK_2[T: DType, largest: Bool = True]` ## Fields * ​p (`Int`): * ​u (`SIMD[T, 1]`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__() -> Self` ### `insert` `insert(mut self, elem: SIMD[T, 1], elem_id: Int)` --- ## topk_fused_sampling_gpu `topk_fused_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //](ctx: DeviceContext, K: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume. --- ## topk_gpu `topk_gpu[type: DType, rank: Int, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, K: Int, input: NDBuffer[type, rank, origin], out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))` Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor. **Parameters:** * ​type (`DType`): DType - The data type of the input tensor. * ​rank (`Int`): Int - The rank of the input tensor. * ​out\_idx\_type (`DType`): DType - The data type of the output indices (default is DType.index). * ​sampling (`Bool`): Bool - Whether to return token samples from topK dist (default is True). * ​largest (`Bool`): Bool - Whether to find the maximum or minimum value. **Args:** * ​ctx (`DeviceContext`): DeviceContext The context for GPU execution. * ​K (`Int`): Int - The number of top elements to keep. * ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] Input tensor as a device NDBuffer. * ​out\_vals (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] Output buffer on device for the K largest values. * ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[DType.index, rank] Output buffer on device for the indices of the K largest values, or sampled token indices. Last dimension is 1 if sampling is True, otherwise K. * ​block\_size (`OptionalReg[Int]`): Int The number of threads per block (default is 256 from TRT and empirical testing). * ​num\_blocks\_per\_input (`OptionalReg[Int]`): OptionalReg\[Int] Number of blocks per input (default computed from input size and block size). This is the equivalent of "BLOCKS\_PER\_BEAM" in TRT-LLM kernel allowing for much larger batch sizes through packing several elements per thread in the first stage. * ​temperature (`SIMD[type, 1]`): The temperature based scaling. --- ## topk_wrapper `topk_wrapper[T: DType, out_idx_type: DType, is_top_p: Bool, largest: Bool = True, _test_sort: Bool = False](K: Int, num_elements: Int, num_blocks_per_input: Int, in_buffer: UnsafePointer[SIMD[T, 1]], local_topk_vals: UnsafePointer[SIMD[T, 1]], local_topk_idxs: UnsafePointer[SIMD[out_idx_type, 1]], p_threshold: UnsafePointer[SIMD[T, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]])` Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling. Arguments: K: Int - Number of top elements to select per block num\_elements: Int - Size of last dimension of input buffer (vocab size) num\_blocks\_per\_input: Int - Number of blocks used to process the input data in\_buffer: UnsafePointer\[Scalar\[T]] - Input buffer containing the elements to process local\_topk\_vals: UnsafePointer\[Scalar\[T]] - Output buffer to store the local top-K values local\_topk\_idxs: UnsafePointer\[Scalar\[out\_idx\_type]] - Output buffer to store the indices of local top-K elements p\_threshold: UnsafePointer\[Scalar\[T]] - Threshold for top-p sampling if is\_top\_p is True else min-p cofficient skip\_sort: UnsafePointer\[Scalar\[DType.bool]] - Output buffer to store whether sorting is needed **Parameters:** * ​T (`DType`): DType - The data type of the elements. * ​out\_idx\_type (`DType`): DType - The data type of the output indices. * ​is\_top\_p (`Bool`): Bool - Whether this if for top-p sampling or min-p sampling. * ​largest (`Bool`): Bool - Whether to find the maximum or minimum value. * ​\_test\_sort (`Bool`): Bool - An internal test flag to not skip sort if testing. --- ## topp_minp_sampling_kernel `topp_minp_sampling_kernel[type: DType, out_idx_type: DType, is_top_p: Bool](p_thresholds_: UnsafePointer[SIMD[type, 1]], sorted_probs_: UnsafePointer[SIMD[type, 1]], sorted_ids_: UnsafePointer[SIMD[out_idx_type, 1]], out_token_ids: UnsafePointer[SIMD[out_idx_type, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]], vocab_size: Int)` Top P-Min P sampling kernel. **Parameters:** * ​type (`DType`): DType - scalar values dtype. * ​out\_idx\_type (`DType`): DType - output index type. * ​is\_top\_p (`Bool`): Bool - Whether to use Top-P (True) or Min-P (False) sampling. --- ## toppminp ## Functions * [​`merge`](./merge): Merge two sorted subarrays into one sorted array. * [​`merge_sort_recursive`](./merge_sort_recursive): Recursive merge sort implementation. * [​`min_p_sampling`](./min_p_sampling): Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P). * [​`sort_buf_descending`](./sort_buf_descending): Sort each batch separately in descending order using parallel merge sort. * [​`top_p_sampling`](./top_p_sampling): Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P). --- ## toppminp_gpu ## Aliases ### `DEBUG_FILE` `alias DEBUG_FILE = False` ### `SEED` `alias SEED = 42` ## Functions * [​`min_p_sampling_gpu`](./min_p_sampling_gpu): GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P). * [​`normalize`](./normalize): * [​`normalize_u32`](./normalize_u32): * [​`radix_sort_pairs_kernel`](./radix_sort_pairs_kernel): Radix pair sort kernel for (default) descending order. * [​`run_radix_sort_pairs_gpu`](./run_radix_sort_pairs_gpu): * [​`top_p_sampling_gpu`](./top_p_sampling_gpu): GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P). * [​`topk_wrapper`](./topk_wrapper): Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling. * [​`topp_minp_sampling_kernel`](./topp_minp_sampling_kernel): Top P-Min P sampling kernel. --- ## torch ## `CustomOpLibrary` {#max.torch.CustomOpLibrary} > *class* max.torch.CustomOpLibrary(kernel\_library) A PyTorch interface to custom operations implemented in Mojo. This API allows for easy passing of PyTorch data as `torch.Tensor` values to the corresponding custom op. `CustomOpLibrary` handles the compilation of the Mojo custom ops and marshalling of data between PyTorch and the executable Mojo code. For example, consider a grayscale operation implemented in Mojo: ```mojo title="my_library/grayscale.mojo" @register("grayscale") struct Grayscale: @staticmethod fn execute[ # The kind of device this is running on: "cpu" or "gpu" target: StaticString, ]( img_out: OutputTensor[type = DType.uint8, rank=2], img_in: InputTensor[type = DType.uint8, rank=3], ctx: DeviceContextPtr, ) raises: ... ``` You can then use `CustomOpLibrary` to invoke the Mojo operation like so: ```python import torch from max.torch import CustomOpLibrary op_library = CustomOpLibrary("my_library") grayscale_op = op_library.grayscale def grayscale(pic: torch.Tensor) -> torch.Tensor: result = pic.new_empty(pic.shape[:-1]) grayscale_op(result, pic) return result img = (torch.rand(64, 64, 3) * 255).to(torch.uint8) result = grayscale(img) ``` The custom operation produced by `op_library.` will have the same interface as the backing Mojo operation. Each `InputTensor` or `OutputTensor` argument corresponds to a [`torch.Tensor`](https://docs.pytorch.org/docs/stable/tensors.html#tensor-class-reference) value in Python. Each argument corresponding to an `OutputTensor` in the Mojo operation will be modified in-place. **Parameters:** **kernel\_library** (`Path` `|` [`KernelLibrary`](graph/KernelLibrary.md#max.graph.KernelLibrary) ) – The path to a `.mojo` file or a `.mojopkg` with your custom op kernels, or the corresponding library object. --- ## Trace `struct Trace[level: TraceLevel, *, category: TraceCategory = TraceCategory(4), target: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)]` An object representing a specific trace. This struct provides functionality for creating and managing trace events for profiling and debugging purposes. ## Parameters * ​level (`TraceLevel`): The trace level to use. * ​category (`TraceCategory`): The trace category to use (defaults to TraceCategory.MAX). * ​target (`Optional[StringSlice[StaticConstantOrigin]]`): Optional target information to include in the trace. ## Fields * ​int\_payload (`OptionalReg[Int]`): Optional integer payload, typically used for task IDs that are appended to trace names. * ​detail (`String`): Additional details about the trace event, included when detailed tracing is enabled. * ​event\_id (`Int`): Unique identifier for the trace event, assigned when the trace begins. * ​parent\_id (`Int`): Identifier of the parent trace event, used for creating hierarchical trace relationships. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, owned _name_value: Variant[String, StringSlice[StaticConstantOrigin]], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given name. **Args:** * ​\_name\_value (`Variant[String, StringSlice[StaticConstantOrigin]]`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. `__init__(out self, owned name: String, detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given string name. **Args:** * ​name (`String`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. `__init__(out self, name: StringSlice[StaticConstantOrigin], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given static string name. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. `__init__(out self, name: StringLiteral[value], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))` Creates a Mojo trace with the given string literal name. **Args:** * ​name (`StringLiteral[value]`): The name that is used to identify this Mojo trace. * ​detail (`String`): Details of the trace entry. * ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be appended to parent name. 0 (default) indicates no parent. * ​task\_id (`OptionalReg[Int]`): Int that is appended to name. ### `__enter__` `__enter__(mut self)` Enters the trace context. This begins recording of the trace event. ### `__exit__` `__exit__(self)` Exits the trace context. This finishes recording of the trace event. ### `mark` `mark(self)` Marks the tracer with the info at the specific point of time. This creates a point event in the trace timeline rather than a range. ### `name` `name(self) -> String` Returns the name of the trace. **Returns:** The name of the trace as a String. ### `start` `start(mut self)` Start recording trace event. This begins recording of the trace event, similar to **enter**. ### `end` `end(mut self)` End recording trace event. This finishes recording of the trace event, similar to **exit**. --- ## trace_arg `trace_arg(name: String, shape: IndexList[size, element_type=element_type]) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument. **Returns:** A string representation of the argument with its shape. `trace_arg(name: String, shape: IndexList[size, element_type=element_type], dtype: DType) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument. * ​dtype (`DType`): The data type of the argument. **Returns:** A string representation of the argument with its shape and data type. `trace_arg(name: String, buf: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​buf (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer to trace. **Returns:** A string representation of the buffer with its shape and data type. --- ## trace_slice_arg `trace_slice_arg(name: String, buf: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> String` Helper to stringify the type and shape of a kernel argument for tracing. **Args:** * ​name (`String`): The name of the argument. * ​buf (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The NDBuffer to trace. **Returns:** A string representation of the buffer with its shape and data type. --- ## TraceCategory `@register_passable(trivial)` `struct TraceCategory` An enum-like struct specifying the type of tracing to perform. ## Fields * ​value (`Int`): The integer value representing the trace category. Used for bitwise operations when determining if profiling is enabled for a specific category. ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Intable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ASYNCRT` `alias ASYNCRT = TraceCategory(1)` ### `Kernel` `alias Kernel = TraceCategory(3)` ### `MAX` `alias MAX = TraceCategory(4)` ### `MEM` `alias MEM = TraceCategory(2)` ### `OTHER` `alias OTHER = TraceCategory(0)` ## Methods ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__is__` `__is__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__int__` `__int__(self) -> Int` Converts the trace category to an integer. **Returns:** The integer value of the trace category. --- ## TraceLevel `@register_passable(trivial)` `struct TraceLevel` An enum-like struct specifying the level of tracing to perform. ## Fields * ​value (`Int`): The integer value representing the trace level. Lower values indicate higher priority trace levels: * 0 (ALWAYS): Always traced * 1 (OP): Operation-level tracing * 2 (THREAD): Thread-level tracing ## Implemented traits `AnyType`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `ALWAYS` `alias ALWAYS = TraceLevel(0)` ### `OP` `alias OP = TraceLevel(1)` ### `THREAD` `alias THREAD = TraceLevel(2)` ## Methods ### `__init__` `@implicit` `__init__(value: Int) -> Self` Initializes a TraceLevel with the given integer value. **Args:** * ​value (`Int`): The integer value for the trace level. ### `__le__` `__le__(self, rhs: Self) -> Bool` Performs less than or equal to comparison. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if this value is less than or equal to `rhs`. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__is__` `__is__(self, rhs: Self) -> Bool` Compares for equality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are equal. ### `__isnot__` `__isnot__(self, rhs: Self) -> Bool` Compares for inequality. **Args:** * ​rhs (`Self`): The value to compare. **Returns:** True if they are not equal. ### `__int__` `__int__(self) -> Int` Converts the trace level to an integer. **Returns:** The integer value of the trace level. --- ## tracing Provides tracing utilities. ## Structs * [​`Trace`](/mojo/stdlib/runtime/tracing/Trace): An object representing a specific trace. * [​`TraceCategory`](/mojo/stdlib/runtime/tracing/TraceCategory): An enum-like struct specifying the type of tracing to perform. * [​`TraceLevel`](/mojo/stdlib/runtime/tracing/TraceLevel): An enum-like struct specifying the level of tracing to perform. ## Functions * [​`get_current_trace_id`](/mojo/stdlib/runtime/tracing/get_current_trace_id): Returns the id of last created trace entry on the current thread. * [​`is_profiling_disabled`](/mojo/stdlib/runtime/tracing/is_profiling_disabled): Returns False if the profiling is enabled for that specific type and level and True otherwise. * [​`is_profiling_enabled`](/mojo/stdlib/runtime/tracing/is_profiling_enabled): Returns True if the profiling is enabled for that specific type and level and False otherwise. * [​`trace_arg`](/mojo/stdlib/runtime/tracing/trace_arg): Helper to stringify the type and shape of a kernel argument for tracing. --- ## Traits A *trait* is a set of requirements that a type must implement. You can think of it as a contract: a type that *conforms* to a trait guarantees that it implements all of the features of the trait. Traits are similar to Java *interfaces*, C++ *concepts*, Swift *protocols*, and Rust *traits*. If you're familiar with any of those features, Mojo traits solve the same basic problem. ## Background In dynamically-typed languages like Python, you don't need to explicitly declare that two classes are similar. This is easiest to show by example: ```python class Duck: def quack(self): print("Quack.") class StealthCow: def quack(self): print("Moo!") def make_it_quack(maybe_a_duck): try: maybe_a_duck.quack() except: print("Not a duck.") make_it_quack(Duck()) make_it_quack(StealthCow()) ``` The `Duck` and `StealthCow` classes aren't related in any way, but they both define a `quack()` method, so they work the same in the `make_it_quack()` function. This works because Python uses dynamic dispatch—it identifies the methods to call at runtime. So `make_it_quack()` doesn't care what types you're passing it, only the fact that they implement the `quack()` method. In a statically-typed environment, this approach doesn't work: Mojo functions require you to specify the type of each argument. If you wanted to write this example in Mojo *without* traits, you'd need to write a function overload for each input type. ```mojo @value struct Duck: fn quack(self): print("Quack") @value struct StealthCow: fn quack(self): print("Moo!") fn make_it_quack(definitely_a_duck: Duck): definitely_a_duck.quack() fn make_it_quack(not_a_duck: StealthCow): not_a_duck.quack() make_it_quack(Duck()) make_it_quack(StealthCow()) ``` ```output Quack Moo! ``` This isn't too bad with only two types. But the more types you want to support, the less practical this approach is. You might notice that the Mojo versions of `make_it_quack()` don't include the `try/except` statement. We don't need it because Mojo's static type checking ensures that you can only pass instances of `Duck` or `StealthCow` into the `make_it_quack()`function. ## Using traits Traits solve this problem by letting you define a shared set of *behaviors* that types can implement. Then you can write a function that depends on the trait, rather than individual types. As an example, let's update the `make_it_quack()` example using traits. The first step is defining a trait: ```mojo trait Quackable: fn quack(self): ... ``` A trait looks a lot like a struct, except it's introduced by the `trait` keyword. A trait can contain method signatures, but it can't implement those methods. Each method signature must be followed by three dots (`...`) to indicate that the method is unimplemented. A trait can also include associated aliases—compile-time constant values that must be defined by conforming structs. Associated aliases are useful for writing traits that describe generic types. For more information, see [Associated aliases for generics](#associated-aliases-for-generics). :::note TODO In the future, we plan to support defining fields and default method implementations inside a trait. ::: Next we create some structs that conform to the `Quackable` trait. To indicate that a struct conforms to a trait, include the trait name in parenthesis after the struct name. You can also include multiple traits, separated by commas. (If you're familiar with Python, this looks just like Python's inheritance syntax.) ```mojo @value struct Duck(Quackable): fn quack(self): print("Quack") @value struct StealthCow(Quackable): fn quack(self): print("Moo!") ``` The struct needs to implement any methods that are declared in the trait. The compiler enforces conformance: if a struct says it conforms to a trait, it must implement everything required by the trait or the code won't compile. Finally, you can define a function that takes a `Quackable` like this: ```mojo fn make_it_quack[type: Quackable](maybe_a_duck: type): maybe_a_duck.quack() ``` This syntax may look a little unfamiliar if you haven't dealt with Mojo [parameters](/mojo/manual/parameters/) before. What this signature means is that `maybe_a_duck` is an argument of type `type`, where `type` is a type that must conform to the `Quackable` trait. Using the method is simple enough: ```mojo make_it_quack(Duck()) make_it_quack(StealthCow()) ``` ```output Quack Moo! ``` Note that you don't need the square brackets when you call `make_it_quack()`: the compiler infers the type of the argument, and ensures the type has the required trait. One limitation of traits is that you can't add traits to existing types. For example, if you define a new `Numeric` trait, you can't add it to the standard library `Float64` and `Int` types. However, the standard library already includes quite a few traits, and we'll be adding more over time. ### Traits can require static methods In addition to regular instance methods, traits can specify required static methods. ```mojo trait HasStaticMethod: @staticmethod fn do_stuff(): ... fn fun_with_traits[type: HasStaticMethod](): type.do_stuff() ``` ## Trait compositions You can compose traits using the `&` sigil. This lets you define new traits that are simple combinations of other traits. You can use a trait composition anywhere that you'd use a single trait: ```mojo trait Flyable: fn fly(self): ... fn quack_and_go[type: Quackable & Flyable](quacker: type): quacker.quack() quacker.fly() @value struct FlyingDuck(Quackable & Flyable): fn quack(self): print("quack") fn fly(self): print("whoosh!") quack_and_go(FlyingDuck()) ``` You can also use the `alias` keyword to create a shorthand name for a trait composition: ```mojo alias DuckLike = Quackable & Flyable struct ToyDuck(DuckLike): # ... implementation omitted ``` Previously, you could only compose traits using [inheritance](#trait-inheritance), by defining a new, empty trait like this: ```mojo trait DuckTrait(Quackable, Flyable): pass ``` The difference is that using the `trait` keyword defines a new, named trait. For a struct to *explictly* conform to this trait, you need to include it in the struct's signature. On the other hand, the `DuckLike` alias represents a composition of two separate traits, `Quackable` and `Flyable`, and anything that conforms to those two traits conforms to `DuckLike`. For example, our earlier `FlyingDuck` type: ```mojo struct FlyingDuck(Quackable & Flyable): # ... etc ``` Because `FlyingDuck` conforms to both `Quackable` and `Flyable`, it also conforms to the `DuckLike` trait composition. But it *doesn't* explicitly conform to `DuckTrait`, since it doesn't include `DuckTrait` in its list of traits. Currently this distinction doesn't make much difference, because Mojo supports [implicit trait conformance](#implicit-trait-conformance), which means that `FlyingDuck` is treated as if it conforms to `DuckTrait`, since it meets all of the requirements. However, implicit conformance is due to be phased out in the future, so we recommend replacing empty traits like `DuckTrait` with more flexible trait compositions. ## Trait inheritance Traits can inherit from other traits. A trait that inherits from another trait includes all of the requirements declared by the parent trait. For example: ```mojo trait Animal: fn make_sound(self): ... # Bird inherits from Animal trait Bird(Animal): fn fly(self): ... ``` Since `Bird` inherits from `Animal`, a struct that conforms to the `Bird` trait needs to implement **both** `make_sound()` and `fly()`. And since every `Bird` conforms to `Animal`, a struct that conforms to `Bird` can be passed to any function that requires an `Animal`. To inherit from multiple traits, add a comma-separated list of traits or trait compositions inside the parenthesis. For example, you could define a `NamedAnimal` trait that combines the requirements of the `Animal` trait and a new `Named` trait: ```mojo trait Named: fn get_name(self) -> String: ... trait NamedAnimal(Animal & Named): # ... ``` Inheritance is useful when you're creating a new trait that adds its own requirements. If you simply want to express the union of two or more traits, you can use a simple trait composition instead: ```mojo alias NamedAnimal = Animal & Named ``` ## Traits and lifecycle methods Traits can specify required [lifecycle methods](/mojo/manual/lifecycle/#lifecycles-and-lifetimes), including constructors, copy constructors and move constructors. For example, the following code creates a `MassProducible` trait. A `MassProducible` type has a default (no-argument) constructor and can be moved. It uses the built-in [`Movable`](/mojo/stdlib/builtin/value/Movable) trait, which requires the type to have a [move constructor](/mojo/manual/lifecycle/life#move-constructor). The `factory[]()` function returns a newly-constructed instance of a `MassProducible` type. ```mojo trait DefaultConstructible: fn __init__(out self): ... alias MassProducible = DefaultConstructible & Movable fn factory[type: MassProducible]() -> type: return type() struct Thing(MassProducible): var id: Int fn __init__(out self): self.id = 0 fn __moveinit__(out self, owned existing: Self): self.id = existing.id var thing = factory[Thing]() ``` Note that [`@register_passable("trivial")`](/mojo/manual/decorators/register-passable#register_passabletrivial) types have restrictions on their lifecycle methods: they can't define copy or move constructors, because they don't require any custom logic. For the purpose of trait conformance, the compiler treats trivial types as copyable and movable. ## Implicit trait conformance Mojo currently supports *implicit* trait conformance, but this will be deprecated in a future release. Implicit conformance means that if a type implements all of the methods required for a trait, it's treated as conforming to the trait, even if it doesn't explicitly include the trait in its declaration: ```mojo struct RubberDucky: fn quack(self): print("Squeak!") make_it_quack(RubberDucky()) ``` Implicit conformance can be convenient, but supporting it prevents us from adding future trait features like default function implementations. We strongly recommend using explicit trait conformance for all new code and phasing out dependence on implicit trait conformance. ## Built-in traits The Mojo standard library includes many traits. They're implemented by a number of standard library types, and you can also implement these on your own types. These standard library traits include: * [`Absable`](/mojo/stdlib/builtin/math/Absable) * [`AnyType`](/mojo/stdlib/builtin/anytype/AnyType) * [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) * [`Comparable`](/mojo/stdlib/builtin/comparable/Comparable) * [`Copyable`](/mojo/stdlib/builtin/value/Copyable) * [`Defaultable`](/mojo/stdlib/builtin/value/Defaultable) * [`Hashable`](/mojo/stdlib/hashlib/hash/Hashable) * [`Indexer`](/mojo/stdlib/builtin/int/Indexer) * [`Intable`](/mojo/stdlib/builtin/int/Intable) * [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) * [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) * [`Movable`](/mojo/stdlib/builtin/value/Movable) * [`PathLike`](/mojo/stdlib/os/pathlike/PathLike) * [`Powable`](/mojo/stdlib/builtin/math/Powable) * [`Representable`](/mojo/stdlib/builtin/repr/Representable) * [`Sized`](/mojo/stdlib/builtin/len/Sized) * [`Stringable`](/mojo/stdlib/builtin/str/Stringable) * [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) * [`Roundable`](/mojo/stdlib/builtin/math/Roundable) * [`Writable`](/mojo/stdlib/utils/write/Writable) * [`Writer`](/mojo/stdlib/utils/write/Writer) The API reference docs linked above include usage examples for each trait. The following sections discuss a few of these traits. ### The `Sized` trait The [`Sized`](/mojo/stdlib/builtin/len/Sized) trait identifies types that have a measurable length, like strings and arrays. Specifically, `Sized` requires a type to implement the `__len__()` method. This trait is used by the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. For example, if you're writing a custom list type, you could implement this trait so your type works with `len()`: ```mojo struct MyList(Sized): var size: Int # ... fn __init__(out self): self.size = 0 fn __len__(self) -> Int: return self.size print(len(MyList())) ``` ```output 0 ``` ### The `Intable` and `IntableRaising` traits The [`Intable`](/mojo/stdlib/builtin/int/Intable) trait identifies a type that can be implicitly converted to `Int`. The [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) trait describes a type can be converted to an `Int`, but the conversion might raise an error. Both of these traits require the type to implement the `__int__()` method. For example: ```mojo @value struct Foo(Intable): var i: Int fn __int__(self) -> Int: return self.i var foo = Foo(42) print(Int(foo) == 42) ``` ```output True ``` ### The `Stringable`, `Representable`, and `Writable` traits The [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait identifies a type that can be explicitly converted to [`String`](/mojo/stdlib/collections/string/string/String). The [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) trait describes a type that can be converted to a `String`, but the conversion might raise an error. These traits also mean that the type can support both the `{!s}` and `{}` format specifiers of the `String` and `StringSlice` class's [`format()`](/mojo/stdlib/collections/string/string/String#format) method. These traits require the type to define the [`__str__()`](/mojo/stdlib/builtin/str/Stringable#__str__) method. In contrast, the [`Representable`](/mojo/stdlib/builtin/repr/Representable) trait defines a type that can be used with the built-in [`repr()`](/mojo/stdlib/builtin/repr/repr) function, as well as the `{!r}` format specifier of the `format()` method. This trait requires the type to define the [`__repr__()`](/mojo/stdlib/builtin/repr/Representable#__repr__) method, which should compute the "official" string representation of a type. If at all possible, this should look like a valid Mojo expression that could be used to recreate a struct instance with the same value. The [`Writable`](/mojo/stdlib/utils/write/Writable) trait describes a type that can be converted to a stream of UTF-8 encoded data by writing to a `Writer` object. The [`print()`](/mojo/stdlib/builtin/io/print) function requires that its arguments conform to the `Writable` trait. This enables efficient stream-based writing by default, avoiding unnecessary intermediate String heap allocations. The `Writable` trait requires a type to implement a [`write_to()`](/mojo/stdlib/utils/write/Writable#write_to) method, which is provided with an object that conforms to the [`Writer`](/mojo/stdlib/utils/write/Writer) as an argument. You then invoke the `Writer` instance's [`write()`](/mojo/stdlib/utils/write/Writer#write) method to write a sequence of `Writable` arguments constituting the `String` representation of your type. While this might sound complex at first, in practice you can minimize boilerplate and duplicated code by using the [`String.write()`](/mojo/stdlib/collections/string/string/String#write) static function to implement the type's `Stringable` implementation in terms of its `Writable` implementation. Here is a simple example of a type that implements all of the `Stringable`, `Representable`, and `Writable` traits: ```mojo @value struct Dog(Stringable, Representable, Writable): var name: String var age: Int # Allows the type to be written into any `Writer` fn write_to[W: Writer](self, mut writer: W) -> None: writer.write("Dog(", self.name, ", ", self.age, ")") # Construct and return a `String` using the previous method fn __str__(self) -> String: return String.write(self) # Alternative full representation when calling `repr` fn __repr__(self) -> String: return String("Dog(name=", repr(self.name), ", age=", repr(self.age), ")") var dog = Dog("Rex", 5) print(repr(dog)) print(dog) var dog_info = StaticString("String: {!s}\nRepresentation: {!r}").format(dog, dog) print(dog_info) ``` ```output Dog(name='Rex', age=5) Dog(Rex, 5) String: Dog(Rex, 5) Representation: Dog(name='Rex', age=5) ``` ### The `AnyType` trait When building a generic container type, one challenge is knowing how to dispose of the contained items when the container is destroyed. Any type that dynamically allocates memory needs to supply a [destructor](/mojo/manual/lifecycle/death#destructor) (`__del__()` method) that must be called to free the allocated memory. But not all types have a destructor, and your Mojo code has no way to determine which is which. The [`AnyType`](/mojo/stdlib/builtin/anytype/AnyType) trait solves this issue: every trait implicitly inherits from `AnyType`, and all structs conform to `AnyType`, which guarantees that the type has a destructor. For types that don't have one, Mojo adds a no-op destructor. This means you can call the destructor on any type. This makes it possible to build generic collections without leaking memory. When the collection's destructor is called, it can safely call the destructors on every item it contains. ## Generic structs with traits You can also use traits when defining a generic container. A generic container is a container (for example, an array or hashmap) that can hold different data types. In a dynamic language like Python it's easy to add different types of items to a container. But in a statically-typed environment the compiler needs to be able to identify the types at compile time. For example, if the container needs to copy a value, the compiler needs to verify that the type can be copied. The [`List`](/mojo/stdlib/collections/list) type is an example of a generic container. A single `List` can only hold a single type of data. For example, you can create a list of integer values like this: ```mojo from collections import List var list = List[Int](1, 2, 3) for i in range(len(list)): print(list[i], sep=" ", end="") ``` ```output 1 2 3 ``` You can use traits to define requirements for elements that are stored in a container. For example, `List` requires elements that can be moved and copied. To store a struct in a `List`, the struct needs to conform to the `Copyable` and `Movable` traits, which require a [copy constructor](/mojo/manual/lifecycle/life#copy-constructor) and a [move constructor](/mojo/manual/lifecycle/life#move-constructor). Building generic containers is an advanced topic. For an introduction, see the section on [parameterized structs](/mojo/manual/parameters/#parameterized-structs). ### Associated aliases for generics In addition to methods, a trait can include _associated aliases_, which must be defined by any conforming struct. For example: ```mojo trait Repeater: alias count: Int ``` An implementing struct must define a concrete constant value for the alias, using any compile-time parameter value. For example, it can use a literal constant or a compile-time expression, including one that uses the struct's parameters. ```mojo struct Doublespeak(Repeater): alias count: Int = 2 struct Multispeak[verbosity: Int](Repeater): alias count: Int = verbosity*2+1 ``` The `Doublespeak` struct has a constant value for the alias, but the `Multispeak` struct lets the user set the value using a parameter: ```mojo repeater = Multispeak[12]() ``` Note that the alias is named `count`, and the `Multispeak` parameter is named `verbosity`. Parameters and aliases are in the same namespace, so the parameter can't have the same name as the associated alias. Associated aliases are most useful for writing traits for generic types. For example, imagine that you want to write a trait that describes a generic stack data structure that stores elements that conform to the `Copyable` and `Movable` traits. By adding the element type as an associated alias to the trait, you can specify generic methods on the trait: ```mojo trait Stacklike: alias EltType: Copyable & Movable fn push(mut self, owned item: Self.EltType): pass fn pop(mut self) -> Self.EltType: pass ``` The following struct implements the `Stacklike` trait using a `List` as the underlying storage: ```mojo struct MyStack[type: Copyable & Movable](Stacklike): """A simple Stack built using a List.""" alias EltType = type alias list_type = List[Self.EltType] var list: Self.list_type fn __init__(out self): self.list = Self.list_type() fn push(mut self, owned item: Self.EltType): self.list.append(item) fn pop(mut self) -> Self.EltType: return self.list.pop() fn dump[WritableEltType: Writable & Copyable & Movable](self: MyStack[WritableEltType]): for item in self.list: print(item[]) ``` The `MyStack` type adds a `dump()` method that prints the contents of the stack. Because a struct that conforms to `Copyable` and `Movable` is not necessarily printable, `MyStack` uses [conditional conformance](/mojo/manual/parameters/#conditional-conformance) to define a `dump()` method that works as long as the element type is [writable](/mojo/stdlib/utils/write/Writable/). The following code exercises this new trait by defining a generic method, `add_to_stack()` that adds an item to any `Stacklike` type. ```mojo def add_to_stack[S: Stacklike](mut stack: S, item: S.EltType): stack.push(item) def main(): s = MyStack[Int]() add_to_stack(s, 12) add_to_stack(s, 33) s.dump() # [12, 33] print(s.pop()) # 33 ``` --- ## transformer ## Modules * [`distributed_transformer`](/max/api/python/nn/transformer/distributed_transformer) * [`transformer`](/max/api/python/nn/transformer/transformer) --- ## transformer ## `ReturnLogits` {#max.nn.transformer.transformer.ReturnLogits} > *class* max.nn.transformer.transformer.ReturnLogits(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None) ### `ALL` {#max.nn.transformer.transformer.ReturnLogits.ALL} > ALL *= 'all'* ### `LAST_TOKEN` {#max.nn.transformer.transformer.ReturnLogits.LAST_TOKEN} > LAST\_TOKEN *= 'last\_token'* ### `VARIABLE` {#max.nn.transformer.transformer.ReturnLogits.VARIABLE} > VARIABLE *= 'variable'* ## `Transformer` {#max.nn.transformer.transformer.Transformer} > *class* max.nn.transformer.transformer.Transformer(dim, n\_heads, layers, norm, output, embedding, kv\_params, kv\_collection\_constructor, return\_logits=ReturnLogits.LAST\_TOKEN, embedding\_multiplier=1.0, logits\_postprocessor=None) Transformer model consisting for TransformerBlock layers. **Parameters:** * **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) * **layers** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `Block` `]` ) * **norm** ([`Layer`](../layer.md#max.nn.layer.Layer) ) * **output** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) `|` [`Linear`](../linear.md#max.nn.linear.Linear) ) * **embedding** ([`EmbeddingV1`](../embedding.md#max.nn.embedding.EmbeddingV1) `|` [`Embedding`](../embedding.md#max.nn.embedding.Embedding) ) * **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) * **kv\_collection\_constructor** ([`FetchContinuousBatchingKVCacheCollection`](../kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.FetchContinuousBatchingKVCacheCollection) `|` `FetchPagedKVCacheCollection` ) * **return\_logits** ([`ReturnLogits`](#max.nn.transformer.transformer.ReturnLogits) ) * **embedding\_multiplier** ([`float`](https://docs.python.org/3/library/functions.html#float) ) * **logits\_postprocessor** (`Callable` `[` `[` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `]` `,` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `]` `|` `None` ) ## `TransformerBlock` {#max.nn.transformer.transformer.TransformerBlock} > *class* max.nn.transformer.transformer.TransformerBlock(attention, mlp, attention\_norm, mlp\_norm, residual\_multiplier=1.0) Stack of Attention, FeedForward, and RMSNorm layers. **Parameters:** * **attention** ([`AttentionImpl`](../attention/interfaces.md#max.nn.attention.interfaces.AttentionImpl) `|` [`AttentionImplQKV`](../attention/interfaces.md#max.nn.attention.interfaces.AttentionImplQKV) `|` [`Module`](../layer.md#max.nn.layer.Module) ) * **mlp** ([`Layer`](../layer.md#max.nn.layer.Layer) ) * **attention\_norm** ([`Layer`](../layer.md#max.nn.layer.Layer) ) * **mlp\_norm** ([`Layer`](../layer.md#max.nn.layer.Layer) ) * **residual\_multiplier** ([`float`](https://docs.python.org/3/library/functions.html#float) ) --- ## Transformer A transformer is a neural network architecture designed to perform complex tasks with sequential data (such as text, speech, and images) in a manner that can be efficiently parallelized on GPUs or other accelerator hardware. This makes them highly effective for natural language processing and other generative AI (GenAI) applications. The transformer model architecture was first introduced in the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani, et al., 2017). This design emphasizes the use of [self-attention](self-attention.mdx) mechanisms instead of recurrent structures like recurrent neural networks (RNNs) or long short-term memory networks (LSTMs), which is what allows for the processing to be parallelized across separate compute cores instead of requiring the model to generate predictions synchronously. This design is currently the foundation for all major large language models (LLMs) such as GPT, Llama, Gemini, DeepSeek, and more. --- ## TransientScheduler `@register_passable(trivial)` `struct TransientScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1]]` ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `MHATileScheduler`, `Movable`, `UnknownDestructibility` ## Aliases ### `may_advance` `alias may_advance = False` ### `mha_schedule` `alias mha_schedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))` ## Methods ### `__init__` `__init__() -> Self` ### `get_current_work_info` `get_current_work_info(self) -> WorkInfo` `get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo` ### `advance` `advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]` ### `grid_dim` `static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]` ### `initial_state` `initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState` ### `unsafe_seq_info` `unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo` --- ## transitional Utilities for transitional period during NDBuffer deprecation. ## Functions * [​`managed_tensor_slice_to_ndbuffer`](/max/api/mojo/tensor/transitional/managed_tensor_slice_to_ndbuffer): --- ## transpose The module implements Transpose functions. ## Functions * [​`transpose`](./transpose): Permute the axis of `input` based on `perms`, and place the result in `output`. * [​`transpose_2d`](./transpose_2d): * [​`transpose_3d_swap_inner`](./transpose_3d_swap_inner): * [​`transpose_3d_swap_outer`](./transpose_3d_swap_outer): * [​`transpose_4d_swap_middle`](./transpose_4d_swap_middle): * [​`transpose_inplace`](./transpose_inplace): * [​`transpose_strided`](./transpose_strided): * [​`transpose_trivial_memcpy`](./transpose_trivial_memcpy): --- ## transpose `transpose[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])` Permute the axis of `input` based on `perms`, and place the result in `output`. Example: ```mojo transpose(output, input, [2, 0, 1]) # guarantees output[x, y, z] = input[z, x, y] ``` **Parameters:** * ​rank (`Int`): The rank of input and output buffers. * ​type (`DType`): The dtype of buffer elements. **Args:** * ​output (`NDBuffer[type, rank, origin, shape]`): The output buffer. * ​input (`NDBuffer[type, rank, origin, shape]`): The input buffer. * ​perms (`UnsafePointer[SIMD[index, 1]]`): Permutation of the input axes. --- ## transpose_2d `transpose_2d[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int, offset: Int)` --- ## transpose_3d_swap_inner `transpose_3d_swap_inner[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)` --- ## transpose_3d_swap_outer `transpose_3d_swap_outer[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)` --- ## transpose_4d_swap_middle `transpose_4d_swap_middle[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape, strides], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)` --- ## transpose_inplace `transpose_inplace[rows: Int, cols: Int, type: DType](buf: NDBuffer[type, 2, origin, __init__[::Indexer,::Indexer](rows, cols)])` --- ## transpose_strided `transpose_strided[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])` --- ## transpose_trivial_memcpy `transpose_trivial_memcpy[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape])` --- ## transpose_z_to_x_or_y `transpose_z_to_x_or_y[destination: StringSlice[StaticConstantOrigin], type: DType](z_col_index: Int, xy_row_index: Int, z_row_suboffset: Int)` --- ## trunc `trunc[T: Truncable, //](value: T) -> T` Get the truncated value of the given object. **Parameters:** * ​T (`Truncable`): The type conforming to Truncable. **Args:** * ​value (`T`): The object to get the truncated value of. **Returns:** The truncated value of the object. --- ## Truncable The `Truncable` trait describes a type that defines a truncation operation. Types that conform to `Truncable` will work with the builtin `trunc` function. The truncation operation always returns the same type as the input. For example: ```mojo from math import Truncable, trunc @value struct Complex(Truncable): var re: Float64 var im: Float64 fn __trunc__(self) -> Self: return Self(trunc(self.re), trunc(self.im)) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__trunc__` `__trunc__(self: _Self) -> _Self` Return the truncated Int value, which is itself. **Returns:** The Int value itself. --- ## tuple Implements the Tuple type. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`Tuple`](/mojo/stdlib/builtin/tuple/Tuple): The type of a literal tuple expression. --- ## Tuple `struct Tuple[*element_types: Copyable & Movable]` The type of a literal tuple expression. A tuple consists of zero or more values, separated by commas. ## Parameters * ​\*element\_types (`Copyable & Movable`): The elements type. ## Fields * ​storage (`!kgen.pack> element_types>`): The underlying storage for the tuple. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self: Tuple[])` Construct an empty tuple. `__init__(out self, owned *args: *element_types)` Construct the tuple. **Args:** * ​\*args (`*element_types`): Initial values. `__init__(out self, *, owned storage: VariadicPack[is_owned, origin, Copyable & Movable, element_types])` Construct the tuple from a low-level internal representation. **Args:** * ​storage (`VariadicPack[is_owned, origin, Copyable & Movable, element_types]`): The variadic pack storage to construct from. ### `__copyinit__` `__copyinit__(out self, existing: Self)` Copy construct the tuple. **Args:** * ​existing (`Self`): The value to copy from. ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Move construct the tuple. **Args:** * ​existing (`Self`): The value to move from. ### `__del__` `__del__(owned self)` Destructor that destroys all of the elements. ### `__getitem__` `__getitem__[idx: Int](ref self) -> ref [self] element_types[idx.value]` Get a reference to an element in the tuple. **Parameters:** * ​idx (`Int`): The element to return. **Returns:** A reference to the specified element. ### `__contains__` `__contains__[T: EqualityComparable & Copyable & Movable](self, value: T) -> Bool` Return whether the tuple contains the specified value. For example: ```mojo var t = Tuple(True, 1, 2.5) if 1 in t: print("t contains 1") ``` **Parameters:** * ​T (`EqualityComparable & Copyable & Movable`): The type of the value. **Args:** * ​value (`T`): The value to search for. **Returns:** True if the value is in the tuple, False otherwise. ### `copy` `copy(self) -> Self` Explicitly construct a copy of self. **Returns:** A copy of this value. ### `__len__` `static __len__() -> Int` Return the number of elements in the tuple. **Returns:** The tuple length. `__len__(self) -> Int` Get the number of elements in the tuple. **Returns:** The tuple length. --- ## tuple_max `tuple_max(t: IntTuple[origin]) -> Int` Calculate the maximum value in an `IntTuple`. This function recursively finds the maximum integer value in a potentially nested `IntTuple` structure. **Args:** * ​t (`IntTuple[origin]`): The `IntTuple` to search. **Returns:** The maximum integer value found in the tuple. --- ## tuple_min `tuple_min(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple` Compute the element-wise minimum of two `IntTuple`s. This function compares corresponding elements of two `IntTuple`s and returns a new `IntTuple` containing the minimum value at each position. Aborts: If the input tuples have different lengths. Note: If either input contains `UNKNOWN_VALUE`, the result will be `UNKNOWN_VALUE`. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple`. * ​b (`IntTuple[origin]`): Second `IntTuple`. **Returns:** A new `IntTuple` with each element being the minimum of the corresponding elements in a and b. --- ## type Library for graph value types. ## `AlgebraicDim` {#max.graph.type.AlgebraicDim} > *class* max.graph.type.AlgebraicDim(value) An algebraic tensor dimension to enable expressions over symbolic dimensions. That is, any expression over a symbolic dimension returns `AlgebraicDim`. Furthermore, algebraic dimensions automatically simplify into a canonical form. The following example demonstrates how to create and use algebraic dimensions with symbolic values: ```python from max.graph import AlgebraicDim, Dim isinstance(Dim("batch") * 5, AlgebraicDim) # Returns True print(Dim("batch") * 5) # Outputs: batch * 5 -Dim("x") - 4 == -(Dim("x") + 4) # Returns True ``` Converts valid input values to Dim. **Parameters:** **attr** (`ParamOperatorAttr` ) ### `apply()` {#max.graph.type.AlgebraicDim.apply} > *classmethod* apply(op, \*operands) **Parameters:** * **op** (`POC` ) * **operands** ([`int`](https://docs.python.org/3/library/functions.html#int) `|` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `|` [`Dim`](#max.graph.type.Dim) `|` [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) ) ### `attr` {#max.graph.type.AlgebraicDim.attr} > attr\*: ParamOperatorAttr\* ### `from_mlir()` {#max.graph.type.AlgebraicDim.from_mlir} > *static* from\_mlir(attr) Constructs a dimension from an `mlir.Attribute`. **Parameters:** * **dim\_attr** – The MLIR Attribute object to parse into a dimension. * **attr** (`TypedAttr` ) **Returns:** The dimension represented by the MLIR Attr value. **Return type:** [Dim](#max.graph.type.Dim) ### `to_mlir()` {#max.graph.type.AlgebraicDim.to_mlir} > to\_mlir() Creates an mlir.Attribute representing this dimension. This is used internally when constructing tensor MLIR types. **Returns:** An mlir.Attribute in the context representing the dimension. **Return type:** *ParamOperatorAttr* ## `Dim` {#max.graph.type.Dim} > *class* max.graph.type.Dim(value) A tensor dimension. Tensor dimensions can be one of three types: * **Static**: Known size * **Symbolic**: Unknown size but named * **Algebraic**: Unknown size has an algebraic expression In most cases, you don’t need to work with a `Dim` directly. Instead, use conversion constructors: ```python from max.graph import Dim, TensorType, DeviceRef tensor_type = TensorType(DType.int64, ("batch", 10), device=DeviceRef.CPU()) ``` This creates a tensor type with three dimensions: * A symbolic “batch” dimension * A static dimension of size 10 For explicit dimension construction, use the following helpers: ```python from max.graph import Dim some_dims = [ SymbolicDim("batch"), StaticDim(5), AlgebraicDim(Dim("batch") + 1), ] ``` Constraining tensor dimensions is one important way to improve model performance. If tensors have unknown dimensions, we can’t optimize them as aggressively. Symbolic tensors allow the compiler to learn constraints on a specific dimension (eg. if 2 inputs have the same batch dimension), but static dims are the easiest to optimize and therefore the easiest to create and work with. Converts valid input values to Dim. **Parameters:** **value** (`DimLike` ) ### `from_mlir()` {#max.graph.type.Dim.from_mlir} > *static* from\_mlir(attr) Constructs a dimension from an `mlir.Attribute`. **Parameters:** * **dim\_attr** – The MLIR Attribute object to parse into a dimension. * **attr** (`TypedAttr` ) **Returns:** The dimension represented by the MLIR Attr value. **Return type:** [Dim](#max.graph.type.Dim) ### `to_mlir()` {#max.graph.type.Dim.to_mlir} > to\_mlir() Creates an `mlir.Attribute` representing this dimension. This is used internally when constructing tensor MLIR types. **Returns:** An `mlir.Attribute` in the context representing the dimension. **Return type:** *TypedAttr* ## `Shape` {#max.graph.type.Shape} > *class* max.graph.type.Shape(dims=()) **Parameters:** **dims** (`ShapeLike` ) ### `from_mlir()` {#max.graph.type.Shape.from_mlir} > *classmethod* from\_mlir(attr) **Parameters:** **attr** (`TypedAttr` ) **Return type:** [*Shape*](#max.graph.type.Shape) ### `rank` {#max.graph.type.Shape.rank} > *property* rank ### `static_dims` {#max.graph.type.Shape.static_dims} > *property* static\_dims\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[int](https://docs.python.org/3/library/functions.html#int)]\* Returns all static dims in the shape as a list of integers. ### `to_mlir()` {#max.graph.type.Shape.to_mlir} > to\_mlir() **Return type:** *ShapeAttr* ## `StaticDim` {#max.graph.type.StaticDim} > *class* max.graph.type.StaticDim(value) A static tensor dimension. Static tensor dimensions will always have exactly the same value, and are key to good model performance. The following example shows how static dimensions can be created implicitly: ```python from max.graph import TensorType from max.dtype import DType tensor = TensorType(DType.int64, (4, 5)) ``` Converts valid input values to Dim. **Parameters:** **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) ### `dim` {#max.graph.type.StaticDim.dim} > dim\*: [int](https://docs.python.org/3/library/functions.html#int)\* The size of the static dimension. ### `from_mlir()` {#max.graph.type.StaticDim.from_mlir} > *static* from\_mlir(attr) Constructs a dimension from an `mlir.Attribute`. **Parameters:** * **dim\_attr** – The MLIR Attribute object to parse into a dimension. * **attr** (`TypedAttr` ) **Returns:** The dimension represented by the MLIR Attr value. **Return type:** [*Dim*](#max.graph.type.Dim) ### `to_mlir()` {#max.graph.type.StaticDim.to_mlir} > to\_mlir() Creates an `mlir.Attribute` representing this dimension. This is used internally when constructing tensor MLIR types. **Returns:** An `mlir.Attribute` in the context representing the dimension. **Return type:** *IntegerAttr* ## `SymbolicDim` {#max.graph.type.SymbolicDim} > *class* max.graph.type.SymbolicDim(value) A symbolic tensor dimension. Symbolic dimensions represent named dimensions in MO tensor types. Symbolic dimensions don’t have a static value, but they allow a readable name to understand what’s going on in the model IR better, and they also allow users to hint to the compiler that two dimensions will have the same value, which can often allow important speedups. In tensor type notation: ```default !mo.tensor ``` The first and second dimensions are named `batch` and `x` respectively. Creating a `SymbolicDim`: ```python dim = SymbolicDim("name") ``` Using `SymbolicDim` in a [`TensorType`](#max.graph.type.TensorType): ```python tensor_type = TensorType(DType.bool, (SymbolicDim("batch"), SymbolicDim("x"), 10)) ``` Converts valid input values to Dim. **Parameters:** **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) ### `from_mlir()` {#max.graph.type.SymbolicDim.from_mlir} > *static* from\_mlir(attr) Constructs a dimension from an `mlir.Attribute`. **Parameters:** * **dim\_attr** – The MLIR Attribute object to parse into a dimension. * **attr** (`TypedAttr` ) **Returns:** The dimension represented by the MLIR Attr value. **Return type:** [Dim](#max.graph.type.Dim) ### `name` {#max.graph.type.SymbolicDim.name} > name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* The name of the dimension. ### `to_mlir()` {#max.graph.type.SymbolicDim.to_mlir} > to\_mlir() Creates an `mlir.Attribute` representing this dimension. This is used internally when constructing tensor MLIR types. **Returns:** An `mlir.Attribute` in the context representing the dimension. **Return type:** *ParamDeclRefAttr* ## `TensorType` {#max.graph.type.TensorType} > *class* max.graph.type.TensorType(dtype, shape, device) A symbolic [`TensorType`](#max.graph.type.TensorType). This is not an eager tensor type! This contains no actual data, but instead represents the type of a value at some point in time during model execution. Most internal values in a model will be tensors. This type represents their element type (`dtype`) and dimensions (`dims`) at a specific point during model computation. It allows us to do some optimistic optimizations and shape inference during graph construction, and to provide more detailed shape information to the compiler for further optimization passes. The following example shows how to create a tensor type with static dimensions and access its properties: ```python from max.graph import TensorType from max.dtype import DType # Create a tensor type with float32 elements and static dimensions 2x3 tensor_type = TensorType(DType.float32, (2, 3)) print(tensor_type.dtype) # Outputs: DType.float32 print(tensor_type.shape) # Outputs: [2, 3] ``` It can also represent a fully dynamic rank tensor. The presence of dynamic rank tensors in a graph will often degrade performance dramatically and prevents many classes of optimizations. An optional device (`device`) can also be provided to indicate the explicit device the tensor is associated with. Constructs a tensor type. **Parameters:** * **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The element type of the tensor data. * **dims** – The shape dimensions of the tensor. The number of dims is the rank of the tensor. * **shape** ([`Shape`](#max.graph.type.Shape) ) * **device** (`DeviceRef` ) ### `as_buffer()` {#max.graph.type.TensorType.as_buffer} > as\_buffer() Returns the analogous buffer type. **Return type:** *BufferType* ### `from_mlir()` {#max.graph.type.TensorType.from_mlir} > *classmethod* from\_mlir(type) Constructs a tensor type from an MLIR type. **Parameters:** * **t** – The MLIR Type object to parse into a tensor type. * **type** (`TensorType` ) **Returns:** The tensor type represented by the MLIR Type value. **Return type:** [*TensorType*](#max.graph.type.TensorType) ### `to_mlir()` {#max.graph.type.TensorType.to_mlir} > to\_mlir() Converts to an `mlir.Type` instance. **Returns:** An `mlir.Type` in the specified Context. **Return type:** *TensorType* ## `Type` {#max.graph.type.Type} > *class* max.graph.type.Type Represents any possible type for Graph values. Every Value in the Graph has a Type, and that type is represented by an Type. This type may be inspected to get finer-grained types and learn more about an individual Value. The following example shows how to work with types in a graph: ```python from max.graph import Graph, TensorType from max.dtype import DType with Graph() as g: # Create a tensor constant with a specific type tensor_type = TensorType(DType.float32, [2, 3]) # The type can be inspected to get information about the value print(f"Tensor element type: {tensor_type.dtype}") # Outputs: DType.float32 print(f"Tensor shape: {tensor_type.shape}") # Outputs: [2, 3] ``` ### `from_mlir()` {#max.graph.type.Type.from_mlir} > *static* from\_mlir(t) Constructs a type from an MLIR type. **Parameters:** **t** (`MlirType` ) – The MLIR Type object to parse into a type. **Returns:** The type represented by the MLIR Type value. **Return type:** [*Type*](#max.graph.type.Type) ### `to_mlir()` {#max.graph.type.Type.to_mlir} > to\_mlir() Converts to an `mlir.Type` instance. **Returns:** An `mlir.Type` in the specified Context. **Return type:** *MlirType* --- ## type_aliases Defines some type aliases. These are Mojo built-ins, so you don't need to import them. ## Aliases ### `AnyTrivialRegType` `alias AnyTrivialRegType = AnyTrivialRegType` Represents any register passable Mojo data type. ### `ImmutableAnyOrigin` `alias ImmutableAnyOrigin = ImmutableAnyOrigin` The immutable origin that might access any memory value. ### `ImmutableOrigin` `alias ImmutableOrigin = ImmutableOrigin` Immutable origin reference type. ### `MutableAnyOrigin` `alias MutableAnyOrigin = MutableAnyOrigin` The mutable origin that might access any memory value. ### `MutableOrigin` `alias MutableOrigin = MutableOrigin` Mutable origin reference type. ### `OriginSet` `alias OriginSet = origin.set` A set of origin parameters. ### `StaticConstantOrigin` `alias StaticConstantOrigin = StaticConstantOrigin` An origin for strings and other always-immutable static constants. ## Structs * [​`Origin`](/mojo/stdlib/builtin/type_aliases/Origin): This represents a origin reference for a memory value. --- ## TypedPythonObject `@register_passable` `struct TypedPythonObject[type_hint: StringSlice[StaticConstantOrigin]]` A wrapper around `PythonObject` that indicates the type of the contained object. The PythonObject structure is entirely dynamically typed. This type provides a weak layer of optional static typing. ## Parameters * ​type\_hint (`StringSlice[StaticConstantOrigin]`): The type name hint indicating the static type of this object. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `PythonConvertible`, `SizedRaising`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(*, owned unsafe_unchecked_from: PythonObject) -> Self` Construct a TypedPythonObject without any validation that the given object is of the specified hinted type. **Args:** * ​unsafe\_unchecked\_from (`PythonObject`): The PythonObject to construct from. This will not be type checked. `__init__(out self: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")], name: StringSlice[StaticConstantOrigin])` Construct a Python module with the given name. **Args:** * ​name (`StringSlice[StaticConstantOrigin]`): The name of the module. **Raises:** If the module creation fails. ### `__copyinit__` `__copyinit__(other: Self) -> Self` Copy an instance of this type. **Args:** * ​other (`Self`): The value to copy. ### `__getitem__` `__getitem__[I: Indexer](self: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], pos: I) -> PythonObject` Get an element from this tuple. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​pos (`I`): The tuple element position to retrieve. **Returns:** The value of the tuple element at the specified position. ### `__len__` `__len__(self) -> Int` Returns the length of the object. **Returns:** The length of the object. ### `to_python_object` `to_python_object(self) -> PythonObject` Convert the TypedPythonObject to a PythonObject. **Returns:** A PythonObject representing the value. ### `unsafe_as_py_object_ptr` `unsafe_as_py_object_ptr(self) -> PyObjectPtr` Get the underlying PyObject pointer. Safety: Use-after-free: The caller must take care that `self` outlives the usage of the pointer returned by this function. **Returns:** The underlying PyObject pointer. --- ## TypeIdentifiable Denotes a type that can be uniquely identified. This trait is intended to be usable for implementing "type map" based functionality. This type will eventually be replaced with a generic compiler interface. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `TYPE_ID` `alias TYPE_ID` The unique identifier. --- ## types This module contains the types for the key-value cache APIs. The module includes structs implementing several different types of [KV caches](/glossary/ai/kv-cache). This module defines two traits that define the roles of the different structs * `KVCacheT`: Defines the interface for a single (key or value) cache. * `KVCollectionT`: Defines the interface for a pair of caches (keys and values). ## Structs * [​`ContinuousBatchingKVCache`](./ContinuousBatchingKVCache): Wrapper for the ContinuousKVCache of a given layer in the transformer model. * [​`ContinuousBatchingKVCacheCollection`](./ContinuousBatchingKVCacheCollection): This is a "view" of the cache for the given sequences in the batch. * [​`KVCacheStaticParams`](./KVCacheStaticParams): * [​`PagedKVCache`](./PagedKVCache): The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention. * [​`PagedKVCacheCollection`](./PagedKVCacheCollection): ## Traits * [​`KVCacheT`](./KVCacheT): Trait for different KVCache types and implementations. * [​`KVCollectionT`](./KVCollectionT): Trait for a pair of caches (keys and values). --- ## Types ```c #include "max/c/types.h" ``` ## Typedefs ### `M_Status` > typedef struct [M\_Status](#_CPPv48M_Status) M\_Status Contains the success or failure of an API call. In general, any API that may fail accepts a `M_Status` argument that is filled in with a meaningful error message on failure. You can create this with [`M_newStatus()`](common.md#common_8h_1adb1ef3fc2e0bcdc8eb17cac3ce91835b). When you’re done, call [`M_freeStatus()`](common.md#common_8h_1ab5067fd51a5696b3679f7f629d3329c4). ### `M_RuntimeConfig` > typedef struct [M\_RuntimeConfig](#_CPPv415M_RuntimeConfig) M\_RuntimeConfig Specifies the MAX Engine configuration. Configuration properties include the number of threads, artifact path, etc. You can create this with [`M_newRuntimeConfig()`](context.md#context_8h_1a963f1d4eefd812ba8691acf516007cfc). When you’re done, call [`M_freeRuntimeConfig()`](context.md#context_8h_1a47f7e22f7f71da9ab5fb3a1886911610). ### `M_RuntimeContext` > typedef struct [M\_RuntimeContext](#_CPPv416M_RuntimeContext) M\_RuntimeContext Contains information that needs to be shared between APIs. You can create this with [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175). When you’re done, call [`M_freeRuntimeContext()`](context.md#context_8h_1a2434a11d8d65890c66f6b5516243a730). ### `M_UInt64Counter` > typedef struct [M\_UInt64Counter](#_CPPv415M_UInt64Counter) M\_UInt64Counter Represents custom counters created by user to be fed to the custom metrics end-point. ### `M_DoubleCounter` > typedef struct [M\_DoubleCounter](#_CPPv415M_DoubleCounter) M\_DoubleCounter ### `M_UInt64Histogram` > typedef struct [M\_UInt64Histogram](#_CPPv417M_UInt64Histogram) M\_UInt64Histogram ### `M_DoubleHistogram` > typedef struct [M\_DoubleHistogram](#_CPPv417M_DoubleHistogram) M\_DoubleHistogram ### `M_Int64Gauge` > typedef struct [M\_Int64Gauge](#_CPPv412M_Int64Gauge) M\_Int64Gauge ### `M_DoubleGauge` > typedef struct [M\_DoubleGauge](#_CPPv413M_DoubleGauge) M\_DoubleGauge ### `M_CustomMetricReader` > typedef struct [M\_CustomMetricReader](#_CPPv420M_CustomMetricReader) M\_CustomMetricReader Represents a custom metrics reader created by the user to generate custom metrics. ### `M_CompileConfig` > typedef struct [M\_CompileConfig](#_CPPv415M_CompileConfig) M\_CompileConfig Specifies the configuration required for model compilation. You can create this with [`M_newCompileConfig()`](model.md#model_8h_1a417e7a581c096ca26c36a1875163b665). When you’re done, call [`M_freeCompileConfig()`](model.md#model_8h_1abbf74b13adaf5bc8a0bb4d46c40688d9). ### `M_DeviceConfig` > typedef struct [M\_DeviceConfig](#_CPPv414M_DeviceConfig) M\_DeviceConfig ### `M_AsyncCompiledModel` > typedef struct [M\_AsyncCompiledModel](#_CPPv420M_AsyncCompiledModel) M\_AsyncCompiledModel Contains an async value to a compiled model. `M_AsyncCompiledModel` can be passed to other APIs that accept compiled models as a function parameter. This async value will eventually resolve to a compiled model or an error in the case of compilation failure. You can create this with [`M_compileModel()`](model.md#model_8h_1a88afca26a64b945885e1e1a0d09b5750). When you’re done, call [`M_freeCompiledModel()`](model.md#model_8h_1a5b6846eb4d47d445eb65c305b1c81b1c). ### `M_AsyncModel` > typedef struct [M\_AsyncModel](#_CPPv412M_AsyncModel) M\_AsyncModel Contains a future used for inference. The future will resolve to a model that’s ready for inference. You can create this with [`M_initModel()`](model.md#model_8h_1a2dcb9570ae117602579182d8faed494a). When you’re done, call [`M_freeModel()`](model.md#model_8h_1a4094fa8e414f8b6a6563474f8840d33c). ### `M_AsyncTensor` > typedef struct [M\_AsyncTensor](#_CPPv413M_AsyncTensor) M\_AsyncTensor Contains an async value to a tensor for inference. You can get this from [`M_getTensorByNameFrom()`](tensor.md#tensor_8h_1a9522ad955454dbd2d044066dea2cad95). When you’re done, call [`M_freeTensor()`](tensor.md#tensor_8h_1a339008df4a10af5e8c01ae970598765c). ### `M_TensorNameArray` > typedef struct [M\_TensorNameArray](#_CPPv417M_TensorNameArray) M\_TensorNameArray Contains an array of tensor names of model inputs or outputs. You can get this from [`M_getInputNames()`](model.md#model_8h_1a625f111600585b4a68c05d9519ff9e3c) and [`M_getOutputNames()`](model.md#model_8h_1a757f1d1f20726e3324d2a0f5683bc0f9). When you’re done, call [`M_freeTensorNameArray()`](tensor.md#tensor_8h_1a7fa5d2aff7f89143ae1905fc29b5b112). ### `M_TensorSpec` > typedef struct [M\_TensorSpec](#_CPPv412M_TensorSpec) M\_TensorSpec Contains the representation of a shape and an element type. You can create this with [`M_newTensorSpec()`](tensor.md#tensor_8h_1a964a8ab740605dbc51321702c34caeef). When you’re done, call [`M_freeTensorSpec()`](tensor.md#tensor_8h_1af0b957daeba1760134c3f24079b53026). ### `M_AsyncTensorMap` > typedef struct [M\_AsyncTensorMap](#_CPPv416M_AsyncTensorMap) M\_AsyncTensorMap Contains a collection of tensors. The collection of tensors is used to represent inputs and outputs when executing a model. You can create this with [`M_newAsyncTensorMap()`](tensor.md#tensor_8h_1a18039c6e6c1769b947120b27178306eb). When you’re done, call [`M_freeAsyncTensorMap()`](tensor.md#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e). ### `M_TensorMapIterator` > typedef struct [M\_TensorMapIterator](#_CPPv419M_TensorMapIterator) M\_TensorMapIterator Contains an iterator over a collection of tensors. Note that the iteration order may not be deterministic. ### `M_AsyncValue` > typedef struct [M\_AsyncValue](#_CPPv412M_AsyncValue) M\_AsyncValue Contains an async value for inference. ### `M_Config` > typedef struct [M\_Config](#_CPPv48M_Config) M\_Config Contains a `Config`. ### `M_AsyncDict` > typedef struct [M\_AsyncDict](#_CPPv411M_AsyncDict) M\_AsyncDict Contains an async value to a dict. ### `M_AsyncList` > typedef struct [M\_AsyncList](#_CPPv411M_AsyncList) M\_AsyncList Contains an async value to a list. ### `M_AsyncTuple` > typedef struct [M\_AsyncTuple](#_CPPv412M_AsyncTuple) M\_AsyncTuple Contains an async value to a tuple. ### `M_AsyncNone` > typedef struct [M\_AsyncNone](#_CPPv411M_AsyncNone) M\_AsyncNone Contains an async value to none. ### `M_MaxContext` > typedef struct [M\_MaxContext](#_CPPv412M_MaxContext) M\_MaxContext Global context for MAX. ### `M_ModelSource` > typedef struct [M\_ModelSource](#_CPPv413M_ModelSource) M\_ModelSource Contains the source format and representation to compile a model. ### `M_WeightsRegistry` > typedef struct [M\_WeightsRegistry](#_CPPv417M_WeightsRegistry) M\_WeightsRegistry Maps unique weight names to their backing data. ### `M_DevicesList` > typedef struct [M\_DevicesList](#_CPPv413M_DevicesList) M\_DevicesList Contains the a list of device pointers. ### `M_DeviceRefsList` > typedef struct [M\_DeviceRefsList](#_CPPv416M_DeviceRefsList) M\_DeviceRefsList Contains the a list of device refs. ## Enums ### `M_Dtype` > enum M\_Dtype Represents all data types supported by the framework. *Values:* #### `M_UNKNOWN` > enumerator M\_UNKNOWN #### `mIsInteger` > enumerator mIsInteger #### `mIsFloat` > enumerator mIsFloat #### `mIsComplex` > enumerator mIsComplex #### `mIsSigned` > enumerator mIsSigned Bit 0 encodes “isSigned”. #### `kIntWidthShift` > enumerator kIntWidthShift #### `M_INT1` > enumerator M\_INT1 #### `M_UINT1` > enumerator M\_UINT1 #### `M_INT2` > enumerator M\_INT2 #### `M_UINT2` > enumerator M\_UINT2 #### `M_INT4` > enumerator M\_INT4 #### `M_UINT4` > enumerator M\_UINT4 #### `M_INT8` > enumerator M\_INT8 #### `M_UINT8` > enumerator M\_UINT8 #### `M_INT16` > enumerator M\_INT16 #### `M_UINT16` > enumerator M\_UINT16 #### `M_INT32` > enumerator M\_INT32 #### `M_UINT32` > enumerator M\_UINT32 #### `M_INT64` > enumerator M\_INT64 #### `M_UINT64` > enumerator M\_UINT64 #### `M_INT128` > enumerator M\_INT128 #### `M_UINT128` > enumerator M\_UINT128 #### `M_FLOAT8_E3M4` > enumerator M\_FLOAT8\_E3M4 Bits 0 through 3 indicate the kind of FP value. #### `M_FLOAT8_E4M3` > enumerator M\_FLOAT8\_E4M3 #### `M_FLOAT8_E4M3FN` > enumerator M\_FLOAT8\_E4M3FN #### `M_FLOAT8_E4M3FNUZ` > enumerator M\_FLOAT8\_E4M3FNUZ #### `M_FLOAT8_E5M2` > enumerator M\_FLOAT8\_E5M2 #### `M_FLOAT8_E5M2FNUZ` > enumerator M\_FLOAT8\_E5M2FNUZ #### `M_FLOAT16` > enumerator M\_FLOAT16 #### `M_BFLOAT16` > enumerator M\_BFLOAT16 #### `M_FLOAT32` > enumerator M\_FLOAT32 #### `M_FLOAT64` > enumerator M\_FLOAT64 #### `M_TF32` > enumerator M\_TF32 #### `M_BOOL` > enumerator M\_BOOL ### `M_AllocatorType` > enum M\_AllocatorType Contains an `AllocatorType`. You can choose between kCaching and kSystem kCaching trades off higher memory usage for better performance. kSystem uses the default system allocator. *Values:* #### `kSystem` > enumerator kSystem #### `kCaching` > enumerator kCaching ### `M_ValueType` > enum M\_ValueType Represents the type of a value. *Values:* #### `M_STRING_VALUE` > enumerator M\_STRING\_VALUE #### `M_DOUBLE_VALUE` > enumerator M\_DOUBLE\_VALUE #### `M_LONG_VALUE` > enumerator M\_LONG\_VALUE #### `M_BOOL_VALUE` > enumerator M\_BOOL\_VALUE #### `M_TENSOR_VALUE` > enumerator M\_TENSOR\_VALUE #### `M_LIST_VALUE` > enumerator M\_LIST\_VALUE #### `M_TUPLE_VALUE` > enumerator M\_TUPLE\_VALUE #### `M_DICT_VALUE` > enumerator M\_DICT\_VALUE #### `M_NONE_VALUE` > enumerator M\_NONE\_VALUE #### `M_UNKNOWN_VALUE` > enumerator M\_UNKNOWN\_VALUE #### `M_MOJO_VALUE` > enumerator M\_MOJO\_VALUE #### `M_PYTHON_MOJO_VALUE` > enumerator M\_PYTHON\_MOJO\_VALUE ### `M_FrameworkFormat` > enum M\_FrameworkFormat Represents the format. *Values:* #### `M_MAX_GRAPH_FRAMEWORK_FORMAT` > enumerator M\_MAX\_GRAPH\_FRAMEWORK\_FORMAT #### `M_TORCHSCRIPT_MODULE_FRAMEWORK_FORMAT` > enumerator M\_TORCHSCRIPT\_MODULE\_FRAMEWORK\_FORMAT #### `M_TORCHSCRIPT_FUNCTION_FRAMEWORK_FORMAT` > enumerator M\_TORCHSCRIPT\_FUNCTION\_FRAMEWORK\_FORMAT #### `M_TORCH_MLIR_FRAMEWORK_FORMAT` > enumerator M\_TORCH\_MLIR\_FRAMEWORK\_FORMAT ### `M_ResultOutputStyle` > enum M\_ResultOutputStyle Represents the result output style for debug printing. *Values:* #### `M_COMPACT` > enumerator M\_COMPACT #### `M_FULL` > enumerator M\_FULL #### `M_BINARY` > enumerator M\_BINARY #### `M_BINARY_MAX_CHECKPOINT` > enumerator M\_BINARY\_MAX\_CHECKPOINT #### `M_NONE` > enumerator M\_NONE --- ## Types All values in Mojo have an associated data type. Most of the types are *nominal* types, defined by a [`struct`](/mojo/manual/structs). These types are nominal (or "named") because type equality is determined by the type's *name*, not its *structure*. There are a some types that aren't defined as structs: * Functions are typed based on their signatures. * `NoneType` is a type with one instance, the `None` object, which is used to signal "no value." Mojo comes with a standard library that provides a number of useful types and utility functions. These standard types aren't privileged. Each of the standard library types is defined just like user-defined types—even basic types like [`Int`](/mojo/stdlib/builtin/int/Int) and [`String`](/mojo/stdlib/collections/string/string/String). But these standard library types are the building blocks you'll use for most Mojo programs. The most common types are *built-in types*, which are always available and don't need to be imported. These include types for numeric values, strings, boolean values, and others. The standard library also includes many more types that you can import as needed, including collection types, utilities for interacting with the filesystem and getting system information, and so on. ## Numeric types Mojo's most basic numeric type is `Int`, which represents a signed integer of the largest size supported by the system—typically 64 bits or 32 bits. Mojo also has built-in types for integer, unsigned integer, and floating-point values of various precisions: | Type name | Description | | --------- | ----------------------------------------------------- | | `Int8` | 8-bit signed integer | | `UInt8` | 8-bit unsigned integer | | `Int16` | 16-bit signed integer | | `UInt16` | 16-bit unsigned integer | | `Int32` | 32-bit signed integer | | `UInt32` | 32-bit unsigned integer | | `Int64` | 64-bit signed integer | | `UInt64` | 64-bit unsigned integer | | `Int128` | 128-bit signed integer | | `UInt128` | 128-bit unsigned integer | | `Int256` | 256-bit signed integer | | `UInt256` | 256-bit unsigned integer | | `Float16` | 16-bit floating point number (IEEE 754-2008 binary16) | | `Float32` | 32-bit floating point number (IEEE 754-2008 binary32) | | `Float64` | 64-bit floating point number (IEEE 754-2008 binary64) | Table 1. Numeric types with specific precision The types in Table 1 are actually all aliases to a single type, [`SIMD`](/mojo/stdlib/builtin/simd/SIMD), which is discussed later. All of the numeric types support the usual numeric and bitwise operators. The [`math`](/mojo/stdlib/math/) module provides a number of additional math functions. You may wonder when to use `Int` and when to use the other integer types. In general, `Int` is a good safe default when you need an integer type and you don't require a specific bit width. Using `Int` as the default integer type for APIs makes APIs more consistent and predictable. ### Signed and unsigned integers Mojo supports both signed (`Int`) and unsigned (`UInt`) integers. You can use the general `Int` or `UInt` types when you do not require a specific bit width. Note that any alias to a fixed-precision type will be of type [`SIMD`](/mojo/stdlib/builtin/simd/SIMD). You might prefer to use unsigned integers over signed integers in conditions where you don't need negative numbers, are not writing for a public API, or need additional range. Mojo's `UInt` type represents an unsigned integer of the [word size](https://en.wikipedia.org/wiki/Word_\(computer_architecture\)) of the CPU, which is 64 bits on 64-bit CPUs and 32 bits on 32-bit CPUs. If you wish to use a fixed size unsigned integer, you can use `UInt8`, `UInt16`, `UInt32`, or `UInt64`, which are aliases to the [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type. Signed and unsigned integers of the same bit width can represent the same number of values, but have different ranges. For example, an `Int8` can represent 256 values ranging from -128 to 127. A `UInt8` can also represent 256 values, but represents a range of 0 to 255. Signed and unsigned integers also have different overflow behavior. When a signed integer overflows outside the range of values that its type can represent, the value overflows to negative numbers. For example, adding `1` to `var si: Int8 = 127` results in `-128`. When an unsigned integer overflows outside the range of values that its type can represent, the value overflows to zero. So, adding `1` to `var ui: UInt8 = 255` is equal to `0`. ### Floating-point numbers Floating-point types represent real numbers. Because not all real numbers can be expressed in a finite number of bits, floating-point numbers can't represent every value exactly. The floating-point types listed in Table 1—`Float64`, `Float32`, and `Float16`—follow the IEEE 754-2008 standard for representing floating-point values. Each type includes a sign bit, one set of bits representing an exponent, and another set representing the fraction or mantissa. Table 2 shows how each of these types are represented in memory. | Type name | Sign | Exponent | Mantissa | | --------- | ----- | -------- | -------- | | `Float64` | 1 bit | 11 bits | 52 bits | | `Float32` | 1 bit | 8 bits | 23 bits | | `Float16` | 1 bit | 5 bits | 10 bits | Table 2. Details of floating-point types Numbers with exponent values of all ones or all zeros represent special values, allowing floating-point numbers to represent infinity, negative infinity, signed zeros, and not-a-number (NaN). For more details on how numbers are represented, see [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) on Wikipedia. A few things to note with floating-point values: * Rounding errors. Rounding may produce unexpected results. For example, 1/3 can't be represented exactly in these floating-point formats. The more operations you perform with floating-point numbers, the more the rounding errors accumulate. * Space between consecutive numbers. The space between consecutive numbers is variable across the range of a floating-point number format. For numbers close to zero, the distance between consecutive numbers is very small. For large positive and negative numbers, the space between consecutive numbers is greater than 1, so it may not be possible to represent consecutive integers. Because the values are approximate, it is rarely useful to compare them with the equality operator (`==`). Consider the following example: ```mojo var big_num = 1.0e16 var bigger_num = big_num+1.0 print(big_num == bigger_num) ``` ```output True ``` Comparison operators (`=` and so on) work with floating point numbers. You can also use the [`math.isclose()`](/mojo/stdlib/math/math/isclose) function to compare whether two floating-point numbers are equal within a specified tolerance. ### Numeric literals In addition to these numeric types, the standard libraries provides integer and floating-point literal types, [`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral) and [`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral). These literal types are used at compile time to represent literal numbers that appear in the code. In general, you should never instantiate these types yourself. Table 3 summarizes the literal formats you can use to represent numbers. | Format | Examples | Notes | | ---------------------- | --------------- | ------------------------------------------------------------------------------------------------ | | Integer literal | `1760` | Integer literal, in decimal format. | | Hexadecimal literal | `0xaa`, `0xFF` | Integer literal, in hexadecimal format.Hex digits are case-insensitive. | | Octal literal | `0o77` | Integer literal, in octal format. | | Binary literal | `0b0111` | Integer literal, in binary format. | | Floating-point literal | `3.14`, `1.2e9` | Floating-point literal.Must include the decimal point to be interpreted as floating-point. | Table 3. Numeric literal formats At compile time, the literal types are arbitrary-precision (also called infinite-precision) values, so the compiler can perform compile-time calculations without overflow or rounding errors. At runtime the values are converted to finite-precision types—`Int` for integer values, and `Float64` for floating-point values. (This process of converting a value that can only exist at compile time into a runtime value is called *materialization*.) The following code sample shows the difference between an arbitrary-precision calculation and the same calculation done using `Float64` values at runtime, which suffers from rounding errors. ```mojo var arbitrary_precision = 3.0 * (4.0 / 3.0 - 1.0) # use a variable to force the following calculation to occur at runtime var three = 3.0 var finite_precision = three * (4.0 / three - 1.0) print(arbitrary_precision, finite_precision) ``` ```output 1.0 0.99999999999999978 ``` ### `SIMD` and `DType` To support high-performance numeric processing, Mojo uses the [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type as the basis for its numeric types. SIMD (single instruction, multiple data) is a processor technology that allows you to perform an operation on an entire set of operands at once. Mojo's `SIMD` type abstracts SIMD operations. A `SIMD` value represents a SIMD *vector*—that is, a fixed-size array of values that can fit into a processor's register. SIMD vectors are defined by two [*parameters*](/mojo/manual/parameters/): * A `DType` value, defining the data type in the vector (for example, 32-bit floating-point numbers). * The number of elements in the vector, which must be a power of two. For example, you can define a vector of four `Float32` values like this: ```mojo var vec = SIMD[DType.float32, 4](3.0, 2.0, 2.0, 1.0) ``` Math operations on SIMD values are applied *elementwise*, on each individual element in the vector. For example: ```mojo var vec1 = SIMD[DType.int8, 4](2, 3, 5, 7) var vec2 = SIMD[DType.int8, 4](1, 2, 3, 4) var product = vec1 * vec2 print(product) ``` ```output [2, 6, 15, 28] ``` ### Scalar values The `SIMD` module defines several *type aliases* that are shorthand for different types of `SIMD` vectors. In particular, the `Scalar` type is just a `SIMD` vector with a single element. The numeric types listed in [Table 1](#table-1), like `Int8` and `Float32` are actually type aliases for different types of scalar values: ```mojo alias Scalar = SIMD[size=1] alias Int8 = Scalar[DType.int8] alias Float32 = Scalar[DType.float32] ``` This may seem a little confusing at first, but it means that whether you're working with a single `Float32` value or a vector of float32 values, the math operations go through exactly the same code path. #### The `DType` type The `DType` struct describes the different data types that a `SIMD` vector can hold, and defines a number of utility functions for operating on those data types. The `DType` struct defines a set of aliases that act as identifiers for the different data types, like `DType.int8` and `DType.float32`. You use these aliases when declaring a `SIMD` vector: ```mojo var v: SIMD[DType.float64, 16] ``` Note that `DType.float64` isn't a *type*, it's a value that describes a data type. You can't create a variable with the type `DType.float64`. You can create a variable with the type `SIMD[DType.float64, 1]` (or `Float64`, which is the same thing). ```mojo from utils.numerics import max_finite, min_finite def describeDType[dtype: DType](): print(dtype, "is floating point:", dtype.is_floating_point()) print(dtype, "is integral:", dtype.is_integral()) print("Min/max finite values for", dtype) print(min_finite[dtype](), max_finite[dtype]()) describeDType[DType.float32]() ``` ```output float32 is floating point: True float32 is integral: False Min/max finite values for float32 -3.4028234663852886e+38 3.4028234663852886e+38 ``` There are several other data types in the standard library that also use the `DType` abstraction. ### Numeric type conversion [Constructors and implicit conversion](/mojo/manual/lifecycle/life/#constructors-and-implicit-conversion) documents the circumstances in which Mojo automatically converts a value from one type to another. Importantly, numeric [operators](/mojo/manual/operators) **don't** automatically narrow or widen operands to a common type. You can explicitly convert a `SIMD` value to a different `SIMD` type either by invoking its [`cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method or by passing it as an argument to the constructor of the target type. For example: ```mojo simd1 = SIMD[DType.float32, 4](2.2, 3.3, 4.4, 5.5) simd2 = SIMD[DType.int16, 4](-1, 2, -3, 4) simd3 = simd1 * simd2.cast[DType.float32]() # Convert with cast() method print("simd3:", simd3) simd4 = simd2 + SIMD[DType.int16, 4](simd1) # Convert with SIMD constructor print("simd4:", simd4) ``` ```output simd3: [-2.2, 6.6, -13.200001, 22.0] simd4: [1, 5, 1, 9] ``` You can convert a `Scalar` value by passing it as an argument to the constructor of the target type. For example: ```mojo var my_int: Int16 = 12 # SIMD[DType.int16, 1] var my_float: Float32 = 0.75 # SIMD[DType.float32, 1] result = Float32(my_int) * my_float # Result is SIMD[DType.float32, 1] print("Result:", result) ``` ```output Result: 9.0 ``` You can convert a scalar value of any numeric type to `Int` by passing the value to the [`Int()`](/mojo/stdlib/builtin/int/Int#__init__) constructor method. Additionally, you can pass an instance of any struct that implements the [`Intable`](/mojo/stdlib/builtin/int/Intable) trait or [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) trait to the `Int()` constructor to convert that instance to an `Int`. You can convert an `Int` or `IntLiteral` value to the `UInt` type by passing the value to the [`UInt()`](/mojo/stdlib/builtin/uint/UInt#__init__) constructor. You can't convert other numeric types to `UInt` directly, though you can first convert them to `Int` and then to `UInt`. ## Strings Mojo's `String` type represents a mutable string. (For Python programmers, note that this is different from Python's standard string, which is immutable.) Strings support a variety of operators and common methods. ```mojo var s: String = "Testing" s += " Mojo strings" print(s) ``` ```output Testing Mojo strings ``` Most standard library types conform to the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait, which represents a type that can be converted to a string. Use `String(value)` to explicitly convert a value to a string: ```mojo var s = String("Items in list: ") + String(5) print(s) ``` ```output Items in list: 5 ``` Or use `String.write` to take variadic `Stringable` types, so you don't have to call `String()` on each value: ```mojo var s = String("Items in list: ", 5) print(s) ``` ```output Items in list: 5 ``` ### String literals As with numeric types, the standard library includes a string literal type used to represent literal strings in the program source. String literals are enclosed in either single or double quotes. Adjacent literals are concatenated together, so you can define a long string using a series of literals broken up over several lines: ``` var s = "A very long string which is " "broken into two literals for legibility." ``` To define a multi-line string, enclose the literal in three single or double quotes: ``` var s = """ Multi-line string literals let you enter long blocks of text, including newlines.""" ``` Note that the triple double quote form is also used for API documentation strings. Unlike `IntLiteral` and `FloatLiteral`, `StringLiteral` doesn't automatically materialize to a runtime type. In some cases, you may need to explicitly convert `StringLiteral` values to `String`. ```mojo # Variable is type `StringLiteral` var s1 = "Example" # Variable is type `String` var s2: String = "Example" # Variable is type `String` var s3 = String("Example") ``` ## Booleans Mojo's `Bool` type represents a boolean value. It can take one of two values, `True` or `False`. You can negate a boolean value using the `not` operator. ```mojo var conditionA = False var conditionB: Bool conditionB = not conditionA print(conditionA, conditionB) ``` ```output False True ``` Many types have a boolean representation. Any type that implements the [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) trait has a boolean representation. As a general principle, collections evaluate as True if they contain any elements, False if they are empty; strings evaluate as True if they have a non-zero length. ## Tuples Mojo's `Tuple` type represents an immutable tuple consisting of zero or more values, separated by commas. Tuples can consist of multiple types and you can index into tuples in multiple ways. ```mojo # Tuples are immutable and can hold multiple types example_tuple = Tuple[Int, String](1, "Example") # Assign multiple variables at once x, y = example_tuple print(x, y) # Get individual values with an index s = example_tuple[1] print(s) ``` ```output 1 Example Example ``` You can also create a tuple without explicit typing. Note that if we declare the same tuple from the previous example with implicit typing instead of explicit, we must also convert `"Example"` from type `StringLiteral` to type `String`. ```mojo example_tuple = (1, String("Example")) s = example_tuple[1] print(s) ``` ```output Example ``` When defining a function, you can explicitly declare the type of tuple elements in one of two ways: ```mojo def return_tuple_1() -> Tuple[Int, Int]: return Tuple[Int, Int](1, 1) def return_tuple_2() -> (Int, Int): return (2, 2) ``` ## Collection types The Mojo standard library also includes a set of basic collection types that can be used to build more complex data structures: * [`List`](/mojo/stdlib/collections/list/List), a dynamically-sized array of items. * [`Dict`](/mojo/stdlib/collections/dict/Dict), an associative array of key-value pairs. * [`Set`](/mojo/stdlib/collections/set/Set), an unordered collection of unique items. * [`Optional`](/mojo/stdlib/collections/optional/Optional) represents a value that may or may not be present. The collection types are *generic types*: while a given collection can only hold a specific type of value (such as `Int` or `Float64`), you specify the type at compile time using a [parameter](/mojo/manual/parameters/). For example, you can create a `List` of `Int` values like this: ```mojo var l = List[Int](1, 2, 3, 4) # l.append(3.14) # error: FloatLiteral cannot be converted to Int ``` You don't always need to specify the type explicitly. If Mojo can *infer* the type, you can omit it. For example, when you construct a list from a set of integer literals, Mojo creates a `List[Int]`. ```mojo # Inferred type == Int var l1 = List(1, 2, 3, 4) ``` Where you need a more flexible collection, the [`Variant`](/mojo/stdlib/utils/variant/Variant) type can hold different types of values. For example, a `Variant[Int32, Float64]` can hold either an `Int32` *or* a `Float64` value at any given time. (Using `Variant` is not covered in this section, see the [API docs](/mojo/stdlib/utils/variant/Variant) for more information.) The following sections give brief introduction to the main collection types. ### List [`List`](/mojo/stdlib/collections/list/List) is a dynamically-sized array of elements. List elements need to conform to the [`Copyable`](/mojo/stdlib/builtin/value/Copyable) and [`Movable`](/mojo/stdlib/builtin/value/Movable) traits. Most of the common standard library primitives, like `Int`, `String`, and `SIMD` conform to this trait. You can create a `List` by passing the element type as a parameter, like this: ```mojo var l = List[String]() ``` The `List` type supports a subset of the Python `list` API, including the ability to append to the list, pop items out of the list, and access list items using subscript notation. ```mojo from collections import List var list = List(2, 3, 5) list.append(7) list.append(11) print("Popping last item from list: ", list.pop()) for idx in range(len(list)): print(list[idx], end=", ") ``` ```output Popping last item from list: 11 2, 3, 5, 7, ``` Note that the previous code sample leaves out the type parameter when creating the list. Because the list is being created with a set of `Int` values, Mojo can *infer* the type from the arguments. * Mojo supports list and dictionary literals for collection initialization: ```mojo # List literal var nums: List[Int] = [2, 3, 5] ``` You can also use variadic arguments for lists: ```mojo var list = List(2, 3, 5) ``` * You can't `print()` a list, or convert it directly into a string. ```mojo # Does not work print(list) ``` As shown above, you can print the individual elements in a list as long as they're a [`Stringable`](/mojo/stdlib/builtin/str/Stringable) type. * Iterating a `List` currently returns a [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to each item, not the item itself. You can access the item using the dereference operator, `[]`: ```mojo #: from collections import List var list = List(2, 3, 4) for item in list: print(item[], end=", ") ``` ```output 2, 3, 4, ``` Subscripting in to a list, however, returns the item directly—no need to dereference: ```mojo #: from collections import List #: var list = List[Int](2, 3, 4) for i in range(len(list)): print(list[i], end=", ") ``` ```output 2, 3, 4, ``` ### Dict The [`Dict`](/mojo/stdlib/collections/dict/Dict) type is an associative array that holds key-value pairs. You can create a `Dict` by specifying the key type and value type as parameters and using dictionary literals: ```mojo # Empty dictionary var empty_dict: Dict[String, Float64] = {} # Dictionary with initial key-value pairs var values: Dict[String, Float64] = {"pi": 3.14159, "e": 2.71828} ``` You can also use the constructor syntax: ```mojo var values = Dict[String, Float64]() ``` The dictionary's key type must conform to the [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait, and value elements must conform to the [`Copyable`](/mojo/stdlib/builtin/value/Copyable) and [`Movable`](/mojo/stdlib/builtin/value/Movable) traits. You can insert and remove key-value pairs, update the value assigned to a key, and iterate through keys, values, or items in the dictionary. The `Dict` iterators all yield references, so you need to use the dereference operator `[]` as shown in the following example: ```mojo var d: Dict[String, Float64] = { "plasticity": 3.1, "elasticity": 1.3, "electricity": 9.7 } for item in d.items(): print(item[].key, item[].value) ``` ```output plasticity 3.1000000000000001 elasticity 1.3 electricity 9.6999999999999993 ``` ### Set The [`Set`](/mojo/stdlib/collections/set/Set) type represents a set of unique values. You can add and remove elements from the set, test whether a value exists in the set, and perform set algebra operations, like unions and intersections between two sets. Sets are generic and the element type must conform to the [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait. Unlike lists and dictionaries, sets do not yet support literal syntax. ```mojo from collections import Set i_like = Set("sushi", "ice cream", "tacos", "pho") you_like = Set("burgers", "tacos", "salad", "ice cream") we_like = i_like.intersection(you_like) print("We both like:") for item in we_like: print("-", item[]) ``` ```output We both like: - ice cream - tacos ``` ### Optional An [`Optional`](/mojo/stdlib/collections/optional/Optional) represents a value that may or may not be present. Like the other collection types, it is generic, and can hold any type that conforms to the [`Copyable`](/mojo/stdlib/builtin/value/Copyable) and [`Movable`](/mojo/stdlib/builtin/value/Movable) traits. ```mojo # Two ways to initialize an Optional with a value var opt1 = Optional(5) var opt2: Optional[Int] = 5 # Two ways to initialize an Optional with no value var opt3 = Optional[Int]() var opt4: Optional[Int] = None ``` An `Optional` evaluates as `True` when it holds a value, `False` otherwise. If the `Optional` holds a value, you can retrieve a reference to the value using the `value()` method. But calling `value()` on an `Optional` with no value results in undefined behavior, so you should always guard a call to `value()` inside a conditional that checks whether a value exists. ```mojo var opt: Optional[String] = String("Testing") if opt: var value_ref = opt.value() print(value_ref) ``` ```output Testing ``` Alternately, you can use the `or_else()` method, which returns the stored value if there is one, or a user-specified default value otherwise: ```mojo var custom_greeting: Optional[String] = None print(custom_greeting.or_else("Hello")) custom_greeting = String("Hi") print(custom_greeting.or_else("Hello")) ``` ```output Hello Hi ``` ## Register-passable, memory-only, and trivial types In various places in the documentation you'll see references to register-passable, memory-only, and trivial types. Register-passable and memory-only types are distinguished based on how they hold data: * Register-passable types are composed exclusively of fixed-size data types, which can (theoretically) be stored in a machine register. A register-passable type can include other types, as long as they are also register-passable. `Int`, `Bool`, and `SIMD`, for example, are all register-passable types. So a register-passable `struct` could include `Int` and `Bool` fields, but not a `String` field. Register-passable types are declared with the [`@register_passable`](/mojo/manual/decorators/register-passable) decorator. Register-passable types are always passed by value (that is, the values are copied). * Memory-only types consist of any types that *don't* fit the description of register-passable types. In particular, these types usually have pointers or references to dynamically-allocated memory. `String`, `List`, and `Dict` are all examples of memory-only types. Our long-term goal is to make this distinction transparent to the user, and ensure all APIs work with both register-passable and memory-only types. But right now you will see some standard library types that only work with register-passable types or only work with memory-only types. In addition to these two categories, Mojo also has "trivial" types. Conceptually a trivial type is simply a type that doesn't require any custom logic in its lifecycle methods. The bits that make up an instance of a trivial type can be copied or moved without any knowledge of what they do. Currently, trivial types are declared using the [`@register_passable(trivial)`](/mojo/manual/decorators/register-passable#register_passabletrivial) decorator. Trivial types shouldn't be limited to only register-passable types, so in the future we intend to separate trivial types from the `@register_passable` decorator. ## `AnyType` and `AnyTrivialRegType` Two other things you'll see in Mojo APIs are references to `AnyType` and `AnyTrivialRegType`. These are effectively *metatypes*, that is, types of types. * `AnyType` represents any Mojo type. Mojo treats `AnyType` as a special kind of trait, and you'll find more discussion of it on the [Traits page](/mojo/manual/traits#the-anytype-trait). * `AnyTrivialRegType` is a metatype representing any Mojo type that's marked register passable. You'll see them in signatures like this: ```mojo fn any_type_function[ValueType: AnyTrivialRegType](value: ValueType): ... ``` You can read this as `any_type_function` has an argument, `value` of type `ValueType`, where `ValueType` is a register-passable type, determined at compile time. There is still some code like this in the standard library, but it's gradually being migrated to more generic code that doesn't distinguish between register-passable and memory-only types. --- ## uint Implements the UInt class. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`UInt`](/mojo/stdlib/builtin/uint/UInt): This type represents an unsigned integer. --- ## UInt `@register_passable(trivial)` `struct UInt` This type represents an unsigned integer. The size of this unsigned integer is platform-dependent. If you wish to use a fixed size unsigned integer, consider using `UInt8`, `UInt16`, `UInt32`, or `UInt64`. ## Fields * ​value (`index`): The underlying storage for the integer value. Note that it is the same type as the `Int.value` field. MLIR doesn't differentiate between signed and unsigned integers when it comes to storing them with the index dialect. The difference is in the operations that are performed on them, which have signed and unsigned variants. ## Implemented traits `Absable`, `AnyType`, `Boolable`, `CeilDivable`, `Comparable`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `Hashable`, `Indexer`, `Intable`, `KeyElement`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Representable`, `Stringable`, `UnknownDestructibility`, `Writable`, `_HashableWithHasher` ## Aliases ### `BITWIDTH` `alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())` The bit width of the integer type. ### `MAX` `alias MAX = UInt((0 if (__init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]()) Returns the maximum integer value. ### `MIN` `alias MIN = UInt(0)` Returns the minimum value of type. ## Methods ### `__init__` `__init__() -> Self` Default constructor that produces zero. `@implicit` `__init__(value: IntLiteral[value]) -> Self` Construct UInt from the given IntLiteral value. **Args:** * ​value (`IntLiteral[value]`): The init value. `@implicit` `__init__(value: Int) -> Self` Construct UInt from the given Int value. **Args:** * ​value (`Int`): The init value. `__init__[T: Indexer](value: T) -> Self` Construct UInt from the given Indexable value. **Parameters:** * ​T (`Indexer`): The type that that can index into a collection or pointer. **Args:** * ​value (`T`): The init value. ### `__bool__` `__bool__(self) -> Bool` Convert this Int to Bool. **Returns:** False Bool value if the value is equal to 0 and True otherwise. ### `__pos__` `__pos__(self) -> Self` Return +self. **Returns:** The +self value. ### `__invert__` `__invert__(self) -> Self` Return \~self. **Returns:** The \~self value. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Return whether this UInt is strictly less than another. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is less than the other UInt and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Compare this Int to the RHS using LE comparison. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this Int is less-or-equal than the RHS Int and False otherwise. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Compare this UInt to the RHS using EQ comparison. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is equal to the RHS UInt and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Compare this UInt to the RHS using NE comparison. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is non-equal to the RHS UInt and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Return whether this UInt is strictly greater than another. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is greater than the other UInt and False otherwise. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Return whether this UInt is greater than or equal to another. **Args:** * ​rhs (`Self`): The other UInt to compare against. **Returns:** True if this UInt is greater than or equal to the other UInt and False otherwise. ### `__add__` `__add__(self, rhs: Self) -> Self` Return `self + rhs`. **Args:** * ​rhs (`Self`): The value to add. **Returns:** `self + rhs` value. ### `__sub__` `__sub__(self, rhs: Self) -> Self` Return `self - rhs`. **Args:** * ​rhs (`Self`): The value to subtract. **Returns:** `self - rhs` value. ### `__mul__` `__mul__(self, rhs: Self) -> Self` Return `self * rhs`. **Args:** * ​rhs (`Self`): The value to multiply with. **Returns:** `self * rhs` value. ### `__truediv__` `__truediv__(self, rhs: Self) -> SIMD[float64, 1]` Return the floating point division of `self` and `rhs`. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `Float64(self)/Float64(rhs)` value. ### `__floordiv__` `__floordiv__(self, rhs: Self) -> Self` Return the division of `self` and `rhs` rounded down to the nearest integer. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** `floor(self/rhs)` value. ### `__mod__` `__mod__(self, rhs: Self) -> Self` Return the remainder of self divided by rhs. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The remainder of dividing self by rhs. ### `__pow__` `__pow__(self, exp: Self) -> Self` Return the value raised to the power of the given exponent. Computes the power of an integer using the Russian Peasant Method. **Args:** * ​exp (`Self`): The exponent value. **Returns:** The value of `self` raised to the power of `exp`. ### `__lshift__` `__lshift__(self, rhs: Self) -> Self` Return `self rhs (`Self`): The value to shift with. **Returns:** `self ### `__rshift__` `__rshift__(self, rhs: Self) -> Self` Return `self >> rhs`. **Args:** * ​rhs (`Self`): The value to shift with. **Returns:** `self >> rhs`. ### `__and__` `__and__(self, rhs: Self) -> Self` Return `self & rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self & rhs`. ### `__or__` `__or__(self, rhs: Self) -> Self` Return `self | rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self | rhs`. ### `__xor__` `__xor__(self, rhs: Self) -> Self` Return `self ^ rhs`. **Args:** * ​rhs (`Self`): The RHS value. **Returns:** `self ^ rhs`. ### `__radd__` `__radd__(self, value: Self) -> Self` Return `value + self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value + self`. ### `__rsub__` `__rsub__(self, value: Self) -> Self` Return `value - self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value - self`. ### `__rmul__` `__rmul__(self, value: Self) -> Self` Return `value * self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value * self`. ### `__rfloordiv__` `__rfloordiv__(self, value: Self) -> Self` Return `value // self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value // self`. ### `__rmod__` `__rmod__(self, value: Self) -> Self` Return `value % self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value % self`. ### `__rpow__` `__rpow__(self, value: Self) -> Self` Return `pow(value,self)`. **Args:** * ​value (`Self`): The other value. **Returns:** `pow(value,self)`. ### `__rlshift__` `__rlshift__(self, value: Self) -> Self` Return `value value (`Self`): The other value. **Returns:** `value ### `__rrshift__` `__rrshift__(self, value: Self) -> Self` Return `value >> self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value >> self`. ### `__rand__` `__rand__(self, value: Self) -> Self` Return `value & self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value & self`. ### `__ror__` `__ror__(self, value: Self) -> Self` Return `value | self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value | self`. ### `__rxor__` `__rxor__(self, value: Self) -> Self` Return `value ^ self`. **Args:** * ​value (`Self`): The other value. **Returns:** `value ^ self`. ### `__iadd__` `__iadd__(mut self, rhs: Self)` Compute `self + rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__isub__` `__isub__(mut self, rhs: Self)` Compute `self - rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imul__` `__imul__(mut self, rhs: Self)` Compute self\*rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__itruediv__` `__itruediv__(mut self, rhs: Self)` Compute `self / rhs`, convert to int, and save the result in self. Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields the same as `__ifloordiv__`. **Args:** * ​rhs (`Self`): The RHS value. ### `__ifloordiv__` `__ifloordiv__(mut self, rhs: Self)` Compute `self // rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__imod__` `__imod__(mut self, rhs: Self)` Compute `self % rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ipow__` `__ipow__(mut self, rhs: Self)` Compute `pow(self, rhs)` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ilshift__` `__ilshift__(mut self, rhs: Self)` Compute `self rhs (`Self`): The RHS value. ### `__irshift__` `__irshift__(mut self, rhs: Self)` Compute `self >> rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__iand__` `__iand__(mut self, rhs: Self)` Compute `self & rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ixor__` `__ixor__(mut self, rhs: Self)` Compute `self ^ rhs` and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__ior__` `__ior__(mut self, rhs: Self)` Compute self|rhs and save the result in self. **Args:** * ​rhs (`Self`): The RHS value. ### `__divmod__` `__divmod__(self, rhs: Self) -> Tuple[UInt, UInt]` Computes both the quotient and remainder using integer division. **Args:** * ​rhs (`Self`): The value to divide on. **Returns:** The quotient and remainder as a `Tuple(self // rhs, self % rhs)`. ### `__index__` `__index__(self) -> index` Convert to index. **Returns:** The corresponding \_\_mlir\_type.index value. ### `__int__` `__int__(self) -> Int` Gets the integral value, wrapping to a negative number on overflow. **Returns:** The value as an integer. ### `__abs__` `__abs__(self) -> Self` Return the absolute value of the UInt value. **Returns:** The absolute value. ### `__ceil__` `__ceil__(self) -> Self` Return the ceiling of the UInt value, which is itself. **Returns:** The UInt value itself. ### `__floor__` `__floor__(self) -> Self` Return the floor of the UInt value, which is itself. **Returns:** The UInt value itself. ### `__round__` `__round__(self) -> Self` Return the rounded value of the UInt value, which is itself. **Returns:** The UInt value itself. `__round__(self, ndigits: Self) -> Self` Return the rounded value of the UInt value, which is itself. **Args:** * ​ndigits (`Self`): The number of digits to round to. **Returns:** The UInt value itself if ndigits >= 0 else the rounded value. ### `__trunc__` `__trunc__(self) -> Self` Return the truncated UInt value, which is itself. **Returns:** The Int value itself. ### `__ceildiv__` `__ceildiv__(self, denominator: Self) -> Self` Return the rounded-up result of dividing self by denominator. **Args:** * ​denominator (`Self`): The denominator. **Returns:** The ceiling of dividing numerator by denominator. ### `is_power_of_two` `is_power_of_two(self) -> Bool` Check if the integer is a (non-zero) power of two. **Returns:** True if the integer is a power of two, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this integer to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `__str__` `__str__(self) -> String` Convert this UInt to a string. A small example. ```mojo x = UInt(50) assert_equal(String(x), "50") ``` **Returns:** The string representation of this UInt. ### `__repr__` `__repr__(self) -> String` Convert this UInt to a string. A small example. ```mojo x = UInt(50) assert_equal(repr(x), "UInt(50)") ``` **Returns:** The string representation of this UInt. ### `__hash__` `__hash__(self) -> Self` Hash the UInt using builtin hash. **Returns:** A 64-bit hash value. This value is *not* suitable for cryptographic uses. Its intended usage is for data structures. See the `hash` builtin documentation for more details. `__hash__[H: _Hasher](self, mut hasher: H)` Updates hasher with this uint value. **Parameters:** * ​H (`_Hasher`): The hasher type. **Args:** * ​hasher (`H`): The hasher instance. --- ## UIntSized The `Sized` trait describes a type that has an integer length (such as a string or array). Any type that conforms to `Sized` or [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the built-in [`len()`](/mojo/stdlib/builtin/len/len) function. The `Sized` trait requires a type to implement the `__len__()` method. For example: ```mojo struct Foo(Sized): var length: Int fn __len__(self) -> Int: return self.length ``` You can pass an instance of `Foo` to the `len()` function to get its length: ```mojo var foo = Foo(42) print(len(foo) == 42) ``` ```plaintext True ``` **Note:** If the `__len__()` method can raise an error, use the [`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__len__` `__len__(self: _Self) -> UInt` Get the length of the type. **Returns:** The length of the type. --- ## ulp `ulp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the ULP (units of last place) or (units of least precision) of the number. **Constraints:** The element type of the inpiut must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): SIMD vector input. **Returns:** The ULP of x. --- ## UMMAInsDescriptor `@register_passable(trivial)` `struct UMMAInsDescriptor[mma_kind: UMMAKind]` Descriptor for UMMA instructions. This struct represents a descriptor that encodes information about UMMA instructions. The descriptor contains the following bit fields: * Sparsity (2 bits): The sparsity of the input matrices. Currently defaults to dense matrices. * Saturate for integer types (1 bits): Whether to saturate the result for integer types. Currently not supported. * Matrix D type (2 bits): Data type of matrix D. * Matrix A type (3 bits): Data type of matrix A. * Matrix B type (3 bits): Data type of matrix B. * Negate A matrix (1 bit): Whether to negate matrix A. Currently defaults to False. * Negate B matrix (1 bit): Whether to negate matrix B. Currently defaults to False. * Transpose A (1 bit): Whether to transpose matrix A. * Transpose B (1 bit): Whether to transpose matrix B. * N, Dimension of Matrix B (6 bits): Number of columns in matrix B. 3 LSBs are unused. * M, Dimension of Matrix A (6 bits): Number of rows in matrix A. 3 LSBs are unused. See: ## Parameters * ​mma\_kind (`UMMAKind`): The kind of UMMA instruction. ## Fields * ​desc (`SIMD[uint32, 1]`): The 32-bit descriptor value that encodes UMMA instruction information. This field stores the complete descriptor with all bit fields packed into a single 32-bit integer: * Bits 0-1: Sparsity selector(2 bits) * Bits 2: Sparsity enable(1 bit) * Bits 3: Saturate for integer types (1 bit) * Bits 4-5: Matrix D type (2 bits) * Bits 6: Reserved (1 bit) * Bits 7-9: Matrix A type (3 bits) * Bits 10-12: Matrix B type (3 bits) * Bits 13: Negate A matrix (1 bit) * Bits 14: Negate B matrix (1 bit) * Bits 15: Transpose A (1 bit) * Bits 16: Transpose B (1 bit) * Bits 17-22: N, Dimension of Matrix B (6 bits) * Bits 23: Reserved (1 bit) * Bits 24-28: M, Dimension of Matrix A (5 bits) * Bits 29: Reserved (1 bit) * Bits 30-31: Maximum shift while attempting B matrix (2 bits) ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(value: SIMD[uint32, 1]) -> Self` Initialize descriptor with raw 32-bit value. This constructor allows creating a descriptor directly from a 32-bit integer that already contains the properly formatted bit fields for the descriptor. **Args:** * ​value (`SIMD[uint32, 1]`): A 32-bit integer containing the complete descriptor bit layout. ### `create` `static create[d_type: DType, a_type: DType, b_type: DType, output_shape: IndexList[2, element_type=uint32], /, *, transpose_a: Bool = False, transpose_b: Bool = True]() -> Self` Create a descriptor for UMMA instructions. This function creates a descriptor for UMMA instructions based on the provided parameters. **Parameters:** * ​d\_type (`DType`): The data type of matrix D. * ​a\_type (`DType`): The data type of matrix A. * ​b\_type (`DType`): The data type of matrix B. * ​output\_shape (`IndexList[2, element_type=uint32]`): The shape of the output matrix. * ​transpose\_a (`Bool`): Whether to transpose matrix A. * ​transpose\_b (`Bool`): Whether to transpose matrix B. **Returns:** A 32-bit integer containing the complete descriptor bit layout. --- ## UMMAKind `@register_passable(trivial)` `struct UMMAKind` Struct for UMMA instruction types. This struct defines the different types of UMMA instructions that is supported by BlackWell. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Aliases ### `KIND_F16` `alias KIND_F16 = UMMAKind(__init__[__mlir_type.!pop.int_literal](2))` f16 type ### `KIND_F8F6F4` `alias KIND_F8F6F4 = UMMAKind(__init__[__mlir_type.!pop.int_literal](3))` f8f6f4 type ### `KIND_I8` `alias KIND_I8 = UMMAKind(__init__[__mlir_type.!pop.int_literal](4))` i8 type ### `KIND_TF32` `alias KIND_TF32 = UMMAKind(__init__[__mlir_type.!pop.int_literal](0))` tf32 type ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Check if two UMMA kinds are equal. **Args:** * ​other (`Self`): The other UMMA kind to compare with. **Returns:** True if the UMMA kinds are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Check if two UMMA kinds are not equal. **Args:** * ​other (`Self`): The other UMMA kind to compare with. **Returns:** True if the UMMA kinds are not equal, False otherwise. ### `__int__` `__int__(self) -> Int` Convert UMMA kind to an integer value. **Returns:** The integer value representing the UMMA instruction type. ### `__str__` `__str__(self) -> String` Convert UMMA kind to a string, this can be used as the instruction qualifier. **Returns:** The PTX qualifier representation of the UMMA kind. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Write the UMMA kind to a writer. **Parameters:** * ​W (`Writer`): The writer type that will receive the formatted output. **Args:** * ​writer (`W`): The writer to write the UMMA kind to. --- ## unfused_qkv_matmul_ragged_paged_gguf_quantized `unfused_qkv_matmul_ragged_paged_gguf_quantized[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, quantization_encoding_q: StringSlice[StaticConstantOrigin], quantization_encoding_k: StringSlice[StaticConstantOrigin], quantization_encoding_v: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[float32, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], q_weight: NDBuffer[uint8, 2, origin, shape], k_weight: NDBuffer[uint8, 2, origin, shape], v_weight: NDBuffer[uint8, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[float32, 2, origin, shape], ctx: DeviceContextPtr)` Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object. Unlike the un-quantized version (kv\_matmul\_ragged\_continuous\_batching), this implementation does not concat the q, k, and v weights together. Instead, it performs three matmuls. This allows the q, k, and v weights to have different quantization encodings. This is only supported on CPU. **Args:** * ​hidden\_state (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size). * ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,) denoting the start of each sequence along the seq\_len dimension. * ​q\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​k\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​v\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size). * ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The Collection object storing KVCache entries. * ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache for the given layer from kv\_collection. * ​output (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_kv\_heads \* head\_size). This is the output buffer for the Q matmul. * ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler. --- ## Unit `struct Unit` Time Unit used by Benchmark Report. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Aliases ### `ms` `alias ms = "ms"` Milliseconds ### `ns` `alias ns = "ns"` Nanoseconds ### `s` `alias s = "s"` Seconds --- ## UnknownDestructibility The most basic trait that all Mojo types extend by default. This trait indicates that a type has no destructor and therefore no lifetime management. It is the default for all types unless they explicitly implement `AnyType` or `ImplicitlyDestructible`. Types with this trait: * Have no `__del__` method * Do not perform any cleanup when they go out of scope * Are suitable for simple value types that don't own resources For types that need cleanup when they are destroyed, use `ImplicitlyDestructible` or `AnyType` instead. --- ## unlikely `unlikely(val: Bool) -> Bool` Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers. **Args:** * ​val (`Bool`): The input value which is likely to be `False` most of the time. **Returns:** The input value. --- ## unlink `unlink[PathLike: PathLike](path: PathLike)` Removes the specified file. If the path is a directory or it can not be deleted, an error is raised. Absolute and relative paths are allowed, relative paths are resolved from cwd. **Parameters:** * ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait. **Args:** * ​path (`PathLike`): The path to the file. --- ## unpack_4bit_int `unpack_4bit_int(val: SIMD[uint32, size], idx: Int) -> SIMD[uint8, 1]` --- ## unsafe Provides utility functions for unsafe manipulation of SIMD values. You can import these APIs from the `memory` package. For example: ```mojo from memory import bitcast ``` ## Functions * [​`bitcast`](/mojo/stdlib/memory/unsafe/bitcast): Bitcasts a SIMD value to another SIMD value. * [​`pack_bits`](/mojo/stdlib/memory/unsafe/pack_bits): Packs a SIMD vector of `bool` values into an integer. --- ## Unsafe pointers The [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) type is one of several pointer types available in the standard library to indirectly reference locations in memory. You can use an `UnsafePointer` to dynamically allocate and free memory, or to point to memory allocated by some other piece of code. You can use these pointers to write code that interacts with low-level interfaces, to interface with other programming languages, or to build array-like data structures. But as the name suggests, they're inherently *unsafe*. For example, when using unsafe pointers, you're responsible for ensuring that memory gets allocated and freed correctly. In general, you should prefer safe pointer types when possible, reserving `UnsafePointer` for those use cases where no other pointer type works. For a comparison of standard library pointer types, see [Intro to pointers](/mojo/manual/pointers/). ## Unsafe pointer basics An `UnsafePointer` is a type that holds an address to memory. You can store and retrieve values in that memory. The `UnsafePointer` type is *generic*—it can point to any type of value, and the value type is specified as a parameter. The value pointed to by a pointer is sometimes called a *pointee*. ```mojo from memory import UnsafePointer # Allocate memory to hold a value var ptr = UnsafePointer[Int].alloc(1) # Initialize the allocated memory ptr.init_pointee_copy(100) ``` ![](../images/pointer-diagram.png#light) ![](../images/pointer-diagram-dark.png#dark) Figure 1. Pointer and pointee Accessing the memory—to retrieve or update a value—is called *dereferencing* the pointer. You can dereference a pointer by following the variable name with an empty pair of square brackets: ```mojo # Update an initialized value ptr[] += 10 # Access an initialized value print(ptr[]) ``` ```output 110 ``` You can also allocate memory to hold multiple values to build array-like structures. For details, see [Storing multiple values](#storing-multiple-values). ## Lifecycle of a pointer At any given time, a pointer can be in one of several states: - Uninitialized. Just like any variable, a variable of type `UnsafePointer` can be declared but uninitialized. ```mojo var ptr: UnsafePointer[Int] ``` - Null. A null pointer has an address of 0, indicating an invalid pointer. ```mojo ptr = UnsafePointer[Int]() ``` - Pointing to allocated, uninitialized memory. The `alloc()` static method returns a pointer to a newly-allocated block of memory with space for the specified number of elements of the pointee's type. ```mojo ptr = UnsafePointer[Int].alloc(1) ``` Trying to dereference a pointer to uninitialized memory results in undefined behavior. - Pointing to initialized memory. You can initialize an allocated, uninitialized pointer by moving or copying an existing value into the memory. Or you can get a pointer to an existing value by calling the constructor with the `to` keyword argument. ```mojo ptr.init_pointee_copy(value) # or ptr.init_pointee_move(value^) # or ptr = UnsafePointer(to=value) ``` Once the value is initialized, you can read or mutate it using the dereference syntax: ```mojo oldValue = ptr[] ptr[] = newValue ``` - Dangling. When you free the pointer's allocated memory, you're left with a *dangling pointer*. The address still points to its previous location, but the memory is no longer allocated to this pointer. Trying to dereference the pointer, or calling any method that would access the memory location results in undefined behavior. ```mojo ptr.free() ``` The following diagram shows the lifecycle of an `UnsafePointer`: ![](../images/pointer-lifecycle.png#light) ![](../images/pointer-lifecycle-dark.png#dark) Figure 2. Lifecycle of an UnsafePointer ### Allocating memory Use the static `alloc()` method to allocate memory. The method returns a new pointer pointing to the requested memory. You can allocate space for one or more values of the pointee's type. ```mojo ptr = UnsafePointer[Int].alloc(10) # Allocate space for 10 Int values ``` The allocated space is *uninitialized*—like a variable that's been declared but not initialized. ### Initializing the pointee To initialize allocated memory, `UnsafePointer` provides the [`init_pointee_copy()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_copy) and [`init_pointee_move()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_move) methods. For example: ```mojo ptr.init_pointee_copy(my_value) ``` To move a value into the pointer's memory location, use `init_pointee_move()`: ```mojo str_ptr.init_pointee_move(my_string^) ``` Note that to move the value, you usually need to add the transfer sigil (`^`), unless the value is a [trivial type](/mojo/manual/types#register-passable-memory-only-and-trivial-types) (like `Int`) or a newly-constructed, "owned" value: ```mojo str_ptr.init_pointee_move(String("Owned string")) ``` Alternately, you can get a pointer to an existing value by calling the `UnsafePointer` constructor with the keyword `to` argument. This is useful for getting a pointer to a value on the stack, for example. ```mojo var counter: Int = 5 ptr = UnsafePointer(to=counter) ``` Note that when calling `UnsafePointer(to=value)`, you don't need to allocate memory, since you're pointing to an existing value. ### Dereferencing pointers Use the `[]` dereference operator to access the value stored at a pointer (the "pointee"). ```mojo # Read from pointee print(ptr[]) # mutate pointee ptr[] = 0 ``` ```output 5 ``` If you've allocated space for multiple values, you can use subscript syntax to access the values, as if they were an array, like `ptr[3]`. The empty subscript `[]` has the same meaning as `[0]`. :::caution The dereference operator assumes that the memory being dereferenced is initialized. Dereferencing uninitialized memory results in undefined behavior. ::: You cannot safely use the dereference operator on uninitialized memory, even to *initialize* a pointee. This is because assigning to a dereferenced pointer calls lifecycle methods on the existing pointee (such as the destructor, move constructor or copy constructor). ```mojo str_ptr = UnsafePointer[String].alloc(1) # str_ptr[] = "Testing" # Undefined behavior! str_ptr.init_pointee_move("Testing") str_ptr[] += " pointers" # Works now ``` ### Destroying or removing values The [`take_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#take_pointee) method moves the pointee from the memory location pointed to by `ptr`. This is a consuming move—it invokes `__moveinit__()` on the destination value. It leaves the memory location uninitialized. The [`destroy_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#destroy_pointee) method calls the destructor on the pointee, and leaves the memory location pointed to by `ptr` uninitialized. Both `take_pointee()` and `destroy_pointee()` require that the pointer is non-null, and the memory location contains a valid, initialized value of the pointee's type; otherwise the function results in undefined behavior. The [`move_pointee_into(self, dst)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#move_pointee_into) method moves the pointee from one pointer location to another. Both pointers must be non-null. The source location must contain a valid, initialized value of the pointee's type, and is left uninitialized after the call. The destination location is assumed to be uninitialized—if it contains a valid value, that value's destructor is not run. The value from the source location is moved to the destination location as a consuming move. This function also has undefined behavior if any of its prerequisites is not met. ### Freeing memory Calling [`free()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#free) on a pointer frees the memory allocated by the pointer. It doesn't call the destructors on any values stored in the memory—you need to do that explicitly (for example, using [`destroy_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#destroy_pointee) or one of the other functions described in [Destroying or removing values](#destroying-or-removing-values)). Disposing of a pointer without freeing the associated memory can result in a memory leak—where your program keeps taking more and more memory, because not all allocated memory is being freed. On the other hand, if you have multiple copies of a pointer accessing the same memory, you need to make sure you only call `free()` on one of them. Freeing the same memory twice is also an error. After freeing a pointer's memory, you're left with a dangling pointer—its address still points to the freed memory. Any attempt to access the memory, like dereferencing the pointer results in undefined behavior. ## Storing multiple values As mentioned in [Allocating memory](#allocating-memory), you can use an `UnsafePointer` to allocate memory for multiple values. The memory is allocated as a single, contiguous block. Pointers support arithmetic: adding an integer to a pointer returns a new pointer offset by the specified number of values from the original pointer: ```mojo third_ptr = first_ptr + 2 ``` Pointers also support subtraction, as well as in-place addition and subtraction: ```mojo # Advance the pointer one element: ptr += 1 ``` ![](../images/pointer-offset.png#light) ![](../images/pointer-offset-dark.png#dark) Figure 3. Pointer arithmetic For example, the following example allocates memory to store 6 `Float64` values, and initializes them all to zero. ```mojo float_ptr = UnsafePointer[Float64].alloc(6) for offset in range(6): (float_ptr+offset).init_pointee_copy(0.0) ``` Once the values are initialized, you can access them using subscript syntax: ```mojo float_ptr[2] = 3.0 for offset in range(6): print(float_ptr[offset], end=", ") ``` ```output 0.0, 0.0, 3.0, 0.0, 0.0, 0.0, ``` ## `UnsafePointer` and origins The `UnsafePointer` struct has an optional `origin` parameter to track the origin of the memory it points to. For pointers initialized with the `to` keyword argument, the origin is set to the origin of the pointee. For example, in the following code, `s_ptr.origin` is the same as the origin of `s`: ```mojo s = String("Testing") s_ptr = UnsafePointer(to=s) ``` When initializing an `UnsafePointer` in other ways, the `origin` defaults to `MutableAnyOrigin`—indicating a pointer that could reference anything in the current scope. If you're using a pointer in the implementation of a struct, you usually don't have to worry about the origin, as long as the pointer isn't exposed outside of of the struct. For example, if you implement a static array type that allocates memory in its constructor, deallocates it in its destructor, and doesn't expose the pointer outside of the struct, the default origin is fine. But if the struct exposes a pointer to that memory, you need to set the origin appropriately. For example, the `List` type has an `unsafe_ptr()` method that returns an `UnsafePointer` to the underlying storage. In this case, the returned pointer should share the origin of the list, since the list is the logical owner of the storage. That method looks something like this: ```mojo fn unsafe_ptr( ref self, ) -> UnsafePointer[ T, mut = Origin(__origin_of(self)).mut, origin = __origin_of(self), ]: return self.data.origin_cast[ mut = Origin(__origin_of(self)).mut, origin = __origin_of(self) ]() ``` This returns a copy of the original pointer, with the origin set to match the origin and mutability of the `self` value. A method like this is unsafe, but setting the correct origin makes it safer, since the compiler knows that the pointer is referring to data owned by the list. ## Working with foreign pointers When exchanging data with other programming languages, you may need to construct an `UnsafePointer` from a foreign pointer. Mojo restricts creating `UnsafePointer` instances from arbitrary addresses, to avoid users accidentally creating pointers that *alias* each other (that is, two pointers that refer to the same location). However, there are specific methods you can use to get an `UnsafePointer` from a Python or C/C++ pointer. When dealing with memory allocated elsewhere, you need to be aware of who's responsible for freeing the memory. Freeing memory allocated elsewhere can result in undefined behavior. You also need to be aware of the format of the data stored in memory, including data types and byte order. For more information, see [Converting data: bitcasting and byte order](#converting-data-bitcasting-and-byte-order). ### Creating a Mojo pointer from a Python pointer The `PythonObject` type defines an [`unsafe_get_as_pointer()`](/mojo/stdlib/python/object/PythonObject#unsafe_get_as_pointer) method to construct an `UnsafePointer` from a Python address. For example, the following code creates a NumPy array and then accesses the data using a Mojo pointer: ```mojo from python import Python from memory import UnsafePointer def share_array(): np = Python.import_module("numpy") arr = np.array(Python.list(1, 2, 3, 4, 5, 6, 7, 8, 9)) ptr = arr.ctypes.data.unsafe_get_as_pointer[DType.int64]() for i in range(9): print(ptr[i], end=", ") print() def main(): share_array() ``` ```output 1, 2, 3, 4, 5, 6, 7, 8, 9, ``` This example uses the NumPy [`ndarray.ctype`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes) attribute to access the raw pointer to the underlying storage (`ndarray.ctype.data`). The `unsafe_get_as_pointer()` method constructs an `UnsafePointer` to this address. ### Working with C/C++ pointers If you call a C/C++ function that returns a pointer using the [`external_call`](/mojo/stdlib/sys/ffi/external_call) function, you can specify the return type as an `UnsafePointer`, and Mojo will handle the type conversion for you. ```mojo from sys.ffi import external_call def get_foreign_pointer() -> UnsafePointer[Int]: ptr = external_call[ "my_c_function", # external function name UnsafePointer[Int] # return type ]() return ptr ``` ## Converting data: bitcasting and byte order Bitcasting a pointer returns a new pointer that has the same memory location, but a new data type. This can be useful if you need to access different types of data from a single area of memory. This can happen when you're reading binary files, like image files, or receiving data over the network. The following sample processes a format that consists of chunks of data, where each chunk contains a variable number of 32-bit integers. Each chunk begins with an 8-bit integer that identifies the number of values in the chunk. ```mojo def read_chunks(owned ptr: UnsafePointer[UInt8]) -> List[List[UInt32]]: chunks = List[List[UInt32]]() # A chunk size of 0 indicates the end of the data chunk_size = Int(ptr[]) while (chunk_size > 0): # Skip the 1 byte chunk_size and get a pointer to the first # UInt32 in the chunk ui32_ptr = (ptr + 1).bitcast[UInt32]() chunk = List[UInt32](capacity=chunk_size) for i in range(chunk_size): chunk.append(ui32_ptr[i]) chunks.append(chunk) # Move our pointer to the next byte after the current chunk ptr += (1 + 4 * chunk_size) # Read the size of the next chunk chunk_size = Int(ptr[]) return chunks ``` When dealing with data read in from a file or from the network, you may also need to deal with byte order. Most systems use little-endian byte order (also called least-signficicant byte, or LSB) where the least-significant byte in a multibyte value comes first. For example, the number 1001 can be represented in hexadecimal as 0x03E9, where E9 is the least-significant byte. Represented as a 16-bit little-endian integer, the two bytes are ordered E9 03. As a 32-bit integer, it would be represented as E9 03 00 00. Big-endian or most-significant byte (MSB) ordering is the opposite: in the 32-bit case, 00 00 03 E9. MSB ordering is frequently used in file formats and when transmitting data over the network. You can use the [`byte_swap()`](/mojo/stdlib/bit/bit/byte_swap) function to swap the byte order of a SIMD value from big-endian to little-endian or the reverse. For example, if the method above was reading big-endian data, you'd just need to change a single line: ```mojo chunk.append(byte_swap(ui32_ptr[i])) ``` ## Working with SIMD vectors The `UnsafePointer` type includes [`load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#load) and [`store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#store) methods for performing aligned loads and stores of scalar values. It also has methods supporting strided load/store and gather/scatter. Strided load loads values from memory into a SIMD vector using an offset (the "stride") between successive memory addresses. This can be useful for extracting rows or columns from tabular data, or for extracting individual values from structured data. For example, consider the data for an RGB image, where each pixel is made up of three 8-bit values, for red, green, and blue. If you want to access just the red values, you can use a strided load or store. ![](../images/strided-load-storage.png#light) ![](../images/strided-load-storage-dark.png#dark) Figure 4. Strided load The following function uses the [`strided_load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_load) and [`strided_store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_store) methods to invert the red pixel values in an image, 8 values at a time. (Note that this function only handles images where the number of pixels is evenly divisible by eight.) ```mojo def invert_red_channel(ptr: UnsafePointer[UInt8], pixel_count: Int): # number of values loaded or stored at a time alias simd_width = 8 # bytes per pixel, which is also the stride size bpp = 3 for i in range(0, pixel_count * bpp, simd_width * bpp): red_values = ptr.offset(i).strided_load[width=simd_width](bpp) # Invert values and store them in their original locations ptr.offset(i).strided_store[width=simd_width](~red_values, bpp) ``` The [`gather()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#gather) and [`scatter()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#scatter) methods let you load or store a set of values that are stored in arbitrary locations. You do this by passing in a SIMD vector of *offsets* to the current pointer. For example, when using `gather()`, the nth value in the vector is loaded from (pointer address) + offset[n]. ## Safety Unsafe pointers are unsafe for several reasons: - Memory management is up to the user. You need to manually allocate and free memory, and/or be aware of when other APIs are allocating or freeing memory for you. - `UnsafePointer` values are *nullable*—that is, the pointer is not guaranteed to point to anything. And even when a pointer points to allocated memory, that memory may not be *initialized*. - `UnsafePointer` does have an `origin` parameter so Mojo can track the origin of the data it points to, but it also provides unsafe APIs. For example, when you do pointer arithmetic, the compiler doesn't do any bounds checking. --- ## unsafe_pointer Implement a generic unsafe pointer type. You can import these APIs from the `memory` package. For example: ```mojo from memory import UnsafePointer ``` ## Structs * [​`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory. --- ## UnsafeMaybeUninitialized `struct UnsafeMaybeUninitialized[ElementType: AnyType]` A memory location that may or may not be initialized. Note that the destructor is a no-op. If the memory was initialized, the caller is responsible for calling `assume_initialized_destroy` before the memory is deallocated. Every method in this struct is unsafe and the caller must know at all times if the memory is initialized or not. Calling a method that assumes the memory is initialized when it is not will result in undefined behavior. ## Parameters * ​ElementType (`AnyType`): The type of the element to store. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Aliases ### `type` `alias type = array ElementType>` ## Methods ### `__init__` `__init__(out self)` The memory is now considered uninitialized. `__init__[MovableType: Movable](out self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)` The memory is now considered initialized. **Parameters:** * ​MovableType (`Movable`): The type of the element to store. **Args:** * ​value (`MovableType`): The value to initialize the memory with. ### `__copyinit__` `__copyinit__(out self, other: Self)` Copy another object. This method should never be called as implicit copy should not be done on memory that may be uninitialized. Trying to call this method will abort. If you wish to perform a copy, you should manually call the method `copy_from` instead. **Args:** * ​other (`Self`): The object to copy. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Move another object. This method should never be called as implicit moves should not be done on memory that may be uninitialized. Trying to call this method will abort. If you wish to perform a move, you should manually call the method `move_from` instead. **Args:** * ​other (`Self`): The object to move. ### `__del__` `__del__(owned self)` This is a no-op. Calling this method assumes that the memory is uninitialized. If the memory was initialized, the caller should use `assume_initialized_destroy` before. ### `copy_from` `copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: UnsafeMaybeUninitialized[CopyableType])` Copy another object. This function assumes that the current memory is uninitialized and the other object is initialized memory. **Parameters:** * ​CopyableType (`ExplicitlyCopyable`): The type object to copy. **Args:** * ​other (`UnsafeMaybeUninitialized[CopyableType]`): The object to copy. `copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: CopyableType)` Copy another object. This function assumes that the current memory is uninitialized. **Parameters:** * ​CopyableType (`ExplicitlyCopyable`): The type object to copy. **Args:** * ​other (`CopyableType`): The object to copy. ### `move_from` `move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], mut other: UnsafeMaybeUninitialized[MovableType])` Move another object. This function assumes that the current memory is uninitialized and the other object is initialized memory. After the function is called, the other object is considered uninitialized. **Parameters:** * ​MovableType (`Movable`): The type object to move. **Args:** * ​other (`UnsafeMaybeUninitialized[MovableType]`): The object to move. `move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], other: UnsafePointer[MovableType])` Move another object. This function assumes that the current memory is uninitialized and the other object is initialized memory. After the function is called, the `other` object is considered uninitialized. **Parameters:** * ​MovableType (`Movable`): The type object to move. **Args:** * ​other (`UnsafePointer[MovableType]`): The pointer to the object to move. ### `write` `write[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)` Write a value into an uninitialized memory location. Calling this method assumes that the memory is uninitialized. **Parameters:** * ​MovableType (`Movable`): The type of the element to store. **Args:** * ​value (`MovableType`): The value to write. ### `assume_initialized` `assume_initialized(ref self) -> ref [self] ElementType` Returns a reference to the internal value. Calling this method assumes that the memory is initialized. **Returns:** A reference to the internal value. ### `unsafe_ptr` `unsafe_ptr(self) -> UnsafePointer[ElementType]` Get a pointer to the underlying element. Note that this method does not assumes that the memory is initialized or not. It can always be called. **Returns:** A pointer to the underlying element. ### `assume_initialized_destroy` `assume_initialized_destroy(mut self)` Runs the destructor of the internal value. Calling this method assumes that the memory is initialized. --- ## UnsafePointer `@register_passable(trivial)` `struct UnsafePointer[type: AnyType, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType](), mut: Bool = True, origin: Origin[mut] = SomeAnyOrigin]` UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory. Because it supports referring to uninitialized memory, it provides unsafe methods for initializing and destroying instances of T, as well as methods for accessing the values once they are initialized. For more information see [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual. For a comparison with other pointer types, see [Intro to pointers](/mojo/manual/pointers/). ## Parameters * ​type (`AnyType`): The type the pointer points to. * ​address\_space (`AddressSpace`): The address space associated with the UnsafePointer allocated memory. * ​alignment (`Int`): The minimum alignment of this pointer known statically. * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): The origin of the memory being addressed. ## Fields * ​address (`pointer *"type", #lit.struct.extract, "value">>`): The underlying pointer. ## Implemented traits `AnyType`, `Boolable`, `Comparable`, `Copyable`, `EqualityComparable`, `ExplicitlyCopyable`, `GreaterThanComparable`, `GreaterThanOrEqualComparable`, `ImplicitlyBoolable`, `Intable`, `LessThanComparable`, `LessThanOrEqualComparable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__() -> Self` Create a null pointer. `__init__(*, ref [origin, address_space] to: type) -> Self` Constructs a Pointer from a reference to a value. **Args:** * ​to (`type`): The value to construct a pointer to. `@implicit` `__init__(other: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self` Exclusivity parameter cast a pointer. **Args:** * ​other (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to cast. `__init__(*, ref [origin] unchecked_downcast_value: PythonObject) -> UnsafePointer[type, mut=mut, origin=origin]` Downcast a `PythonObject` known to contain a Mojo object to a pointer. This operation is only valid if the provided Python object contains an initialized Mojo object of matching type. **Args:** * ​unchecked\_downcast\_value (`PythonObject`): The Python object to downcast from. ### `__bool__` `__bool__(self) -> Bool` Return true if the pointer is non-null. **Returns:** Whether the pointer is null. ### `__getitem__` `__getitem__(self) -> ref [origin, address_space] type` Return a reference to the underlying data. **Returns:** A reference to the value. `__getitem__[I: Indexer, //](self, offset: I) -> ref [origin, address_space] type` Return a reference to the underlying data, offset by the given index. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. **Returns:** An offset reference. ### `__lt__` `__lt__(self, rhs: Self) -> Bool` Returns True if this pointer represents a lower address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a lower address and False otherwise. ### `__le__` `__le__(self, rhs: Self) -> Bool` Returns True if this pointer represents a lower than or equal address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a lower address and False otherwise. ### `__eq__` `__eq__(self, rhs: Self) -> Bool` Returns True if the two pointers are equal. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if the two pointers are equal and False otherwise. ### `__ne__` `__ne__(self, rhs: Self) -> Bool` Returns True if the two pointers are not equal. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if the two pointers are not equal and False otherwise. ### `__gt__` `__gt__(self, rhs: Self) -> Bool` Returns True if this pointer represents a higher address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a higher than or equal address and False otherwise. ### `__ge__` `__ge__(self, rhs: Self) -> Bool` Returns True if this pointer represents a higher than or equal address than rhs. **Args:** * ​rhs (`Self`): The value of the other pointer. **Returns:** True if this pointer represents a higher than or equal address and False otherwise. ### `__add__` `__add__[I: Indexer, //](self, offset: I) -> Self` Return a pointer at an offset from the current one. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. **Returns:** An offset pointer. ### `__sub__` `__sub__[I: Indexer, //](self, offset: I) -> Self` Return a pointer at an offset from the current one. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. **Returns:** An offset pointer. ### `__iadd__` `__iadd__[I: Indexer, //](mut self, offset: I)` Add an offset to this pointer. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. ### `__isub__` `__isub__[I: Indexer, //](mut self, offset: I)` Subtract an offset from this pointer. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​offset (`I`): The offset index. ### `copy` `copy(self) -> Self` Copy an existing pointer. **Returns:** A copy of the value. ### `address_of` `static address_of(ref [address_space] arg: type) -> UnsafePointer[type, address_space=address_space, alignment=1, mut=arg_is_mut, origin=arg_is_origin]` Gets the address of the argument. **Args:** * ​arg (`type`): The value to get the address of. **Returns:** An UnsafePointer which contains the address of the argument. ### `alloc` `static alloc(count: Int) -> UnsafePointer[type, alignment=alignment, origin={}]` Allocate an array with specified or default alignment. **Args:** * ​count (`Int`): The number of elements in the array. **Returns:** The pointer to the newly allocated array. ### `offset` `offset[I: Indexer, //](self, idx: I) -> Self` Returns a new pointer shifted by the specified offset. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The offset of the new pointer. **Returns:** The new constructed UnsafePointer. ### `__merge_with__` `__merge_with__[: Int, : Bool, : Origin[$1], //, other_type: AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]](self) -> UnsafePointer[type, address_space=address_space, alignment=min(alignment, alignment), mut=mut, origin=origin]` Returns a pointer merged with the specified `other_type`. **Parameters:** * ​other\_type (`AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]`): The type of the pointer to merge with. **Returns:** A pointer merged with the specified `other_type`. ### `__as_bool__` `__as_bool__(self) -> Bool` Return true if the pointer is non-null. **Returns:** Whether the pointer is null. ### `__int__` `__int__(self) -> Int` Returns the pointer address as an integer. **Returns:** The address of the pointer as an Int. ### `__str__` `__str__(self) -> String` Gets a string representation of the pointer. **Returns:** The string representation of the pointer. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats this pointer address to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The object to write to. ### `as_noalias_ptr` `as_noalias_ptr(self) -> Self` Cast the pointer to a new pointer that is known not to locally alias any other pointer. In other words, the pointer transitively does not alias any other memory value declared in the local function context. This information is relayed to the optimizer. If the pointer does locally alias another memory value, the behaviour is undefined. **Returns:** A noalias pointer. ### `load` `load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[dtype, width]` Loads the value the pointer points to. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. * ​invariant (`Bool`): Whether the memory is load invariant. **Returns:** The loaded value. `load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, 1]) -> SIMD[dtype, width]` Loads the value the pointer points to with the given offset. **Constraints:** The width and alignment must be positive integer values. The offset must be integer. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​offset (`SIMD[dtype, 1]`): The offset to load from. **Returns:** The loaded value. `load[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I) -> SIMD[dtype, width]` Loads the value the pointer points to with the given offset. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. * ​invariant (`Bool`): Whether the memory is load invariant. **Args:** * ​offset (`I`): The offset to load from. **Returns:** The loaded value. ### `store` `store[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I, val: SIMD[dtype, width])` Stores a single element value at the given offset. **Constraints:** The width and alignment must be positive integer values. The offset must be integer. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. **Args:** * ​offset (`I`): The offset to store to. * ​val (`SIMD[dtype, width]`): The value to store. `store[dtype: DType, offset_type: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[offset_type, 1], val: SIMD[dtype, width])` Stores a single element value at the given offset. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector elements. * ​offset\_type (`DType`): The data type of the offset value. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. **Args:** * ​offset (`SIMD[offset_type, 1]`): The offset to store to. * ​val (`SIMD[dtype, width]`): The value to store. `store[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width])` Stores a single element value. **Constraints:** The width and alignment must be positive integer values. **Parameters:** * ​dtype (`DType`): The data type of SIMD vector elements. * ​width (`Int`): The size of the SIMD vector. * ​alignment (`Int`): The minimal alignment of the address. * ​volatile (`Bool`): Whether the operation is volatile or not. **Args:** * ​val (`SIMD[dtype, width]`): The value to store. ### `strided_load` `strided_load[dtype: DType, T: Intable, //, width: Int](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: T) -> SIMD[dtype, width]` Performs a strided load of the SIMD vector. **Parameters:** * ​dtype (`DType`): DType of returned SIMD value. * ​T (`Intable`): The Intable type of the stride. * ​width (`Int`): The SIMD width. **Args:** * ​stride (`T`): The stride between loads. **Returns:** A vector which is stride loaded. ### `strided_store` `strided_store[dtype: DType, T: Intable, //, width: Int = 1](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width], stride: T)` Performs a strided store of the SIMD vector. **Parameters:** * ​dtype (`DType`): DType of `val`, the SIMD value to store. * ​T (`Intable`): The Intable type of the stride. * ​width (`Int`): The SIMD width. **Args:** * ​val (`SIMD[dtype, width]`): The SIMD value to store. * ​stride (`T`): The stride between stores. ### `gather` `gather[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True), default: SIMD[dtype, width] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]` Gathers a SIMD vector from offsets of the current pointer. This method loads from memory addresses calculated by appropriately shifting the current pointer according to the `offset` SIMD vector, or takes from the `default` SIMD vector, depending on the values of the `mask` SIMD vector. If a mask element is `True`, the respective result element is given by the current pointer and the `offset` SIMD vector; otherwise, the result element is taken from the `default` SIMD vector. **Constraints:** The offset type must be an integral type. The alignment must be a power of two integer value. **Parameters:** * ​dtype (`DType`): DType of the return SIMD. * ​width (`Int`): The SIMD width. * ​alignment (`Int`): The minimal alignment of the address. **Args:** * ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to gather from. * ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each element whether to load from memory or to take from the `default` SIMD vector. * ​default (`SIMD[dtype, width]`): The SIMD vector providing default values to be taken where the `mask` SIMD vector is `False`. **Returns:** The SIMD vector containing the gathered values. ### `scatter` `scatter[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], val: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True))` Scatters a SIMD vector into offsets of the current pointer. This method stores at memory addresses calculated by appropriately shifting the current pointer according to the `offset` SIMD vector, depending on the values of the `mask` SIMD vector. If a mask element is `True`, the respective element in the `val` SIMD vector is stored at the memory address defined by the current pointer and the `offset` SIMD vector; otherwise, no action is taken for that element in `val`. If the same offset is targeted multiple times, the values are stored in the order they appear in the `val` SIMD vector, from the first to the last element. **Constraints:** The offset type must be an integral type. The alignment must be a power of two integer value. **Parameters:** * ​dtype (`DType`): DType of `value`, the result SIMD buffer. * ​width (`Int`): The SIMD width. * ​alignment (`Int`): The minimal alignment of the address. **Args:** * ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to scatter into. * ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to be scattered. * ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each element whether to store at memory or not. ### `free` `free(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])` Free the memory referenced by the pointer. ### `bitcast` `bitcast[T: AnyType = type](self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Bitcasts a UnsafePointer to a different type. **Parameters:** * ​T (`AnyType`): The target type. **Returns:** A new UnsafePointer object with the specified type and the same address, as the original UnsafePointer. ### `static_alignment_cast` `static_alignment_cast[alignment: Int = alignment](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Changes the `alignment` of an `UnsafePointer`. The static alignment of an UnsafePointer must be greater or equal to the actual alignment of the runtime pointer value. Casting an UnsafePointer to a static alignment greater than its runtime alignment may cause undefined behavior". This only changes the compile-time alignment encoded in the type of this pointer. This does not change the alignment of the pointer address at runtime. **Parameters:** * ​alignment (`Int`): Alignment of the destination pointer. **Returns:** A new UnsafePointer object with the same type, address\_space, and address, as the original UnsafePointer, and the new specified alignment. ### `origin_cast` `origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Changes the origin or mutability of a pointer. **Parameters:** * ​mut (`Bool`): Whether the origin is mutable. * ​origin (`Origin[mut]`): Origin of the destination pointer. **Returns:** A new UnsafePointer object with the same type and the same address, as the original UnsafePointer and the new specified mutability and origin. ### `address_space_cast` `address_space_cast[address_space: AddressSpace = address_space](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]` Casts an UnsafePointer to a different address space. **Parameters:** * ​address\_space (`AddressSpace`): The address space of the result. **Returns:** A new UnsafePointer object with the same type and the same address, as the original UnsafePointer and the new address space. ### `destroy_pointee` `destroy_pointee(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])` Destroy the pointed-to value. The pointer must not be null, and the pointer memory location is assumed to contain a valid initialized instance of `type`. This is equivalent to `_ = self.take_pointee()` but doesn't require `Movable` and is more efficient because it doesn't invoke `__moveinit__`. ### `take_pointee` `take_pointee[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]) -> T` Move the value at the pointer out, leaving it uninitialized. The pointer must not be null, and the pointer memory location is assumed to contain a valid initialized instance of `T`. This performs a *consuming* move, ending the origin of the value stored in this pointer memory location. Subsequent reads of this pointer are not valid. If a new valid value is stored using `init_pointee_move()`, then reading from this pointer becomes valid again. **Parameters:** * ​T (`Movable`): The type the pointer points to, which must be `Movable`. **Returns:** The value at the pointer. ### `init_pointee_move` `init_pointee_move[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], owned value: T)` Emplace a new value into the pointer location, moving from `value`. The pointer memory location is assumed to contain uninitialized data, and consequently the current contents of this pointer are not destructed before writing `value`. Similarly, ownership of `value` is logically transferred into the pointer location. When compared to `init_pointee_copy`, this avoids an extra copy on the caller side when the value is an `owned` rvalue. **Parameters:** * ​T (`Movable`): The type the pointer points to, which must be `Movable`. **Args:** * ​value (`T`): The value to emplace. ### `init_pointee_copy` `init_pointee_copy[T: Copyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)` Emplace a copy of `value` into the pointer location. The pointer memory location is assumed to contain uninitialized data, and consequently the current contents of this pointer are not destructed before writing `value`. Similarly, ownership of `value` is logically transferred into the pointer location. When compared to `init_pointee_move`, this avoids an extra move on the callee side when the value must be copied. **Parameters:** * ​T (`Copyable`): The type the pointer points to, which must be `Copyable`. **Args:** * ​value (`T`): The value to emplace. ### `init_pointee_explicit_copy` `init_pointee_explicit_copy[T: ExplicitlyCopyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)` Emplace a copy of `value` into this pointer location. The pointer memory location is assumed to contain uninitialized data, and consequently the current contents of this pointer are not destructed before writing `value`. Similarly, ownership of `value` is logically transferred into the pointer location. When compared to `init_pointee_move`, this avoids an extra move on the callee side when the value must be copied. **Parameters:** * ​T (`ExplicitlyCopyable`): The type the pointer points to, which must be `ExplicitlyCopyable`. **Args:** * ​value (`T`): The value to emplace. ### `move_pointee_into` `move_pointee_into[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], dst: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin])` Moves the value `self` points to into the memory location pointed to by `dst`. This performs a consuming move (using `__moveinit__()`) out of the memory location pointed to by `self`. Subsequent reads of this pointer are not valid unless and until a new, valid value has been moved into this pointer's memory location using `init_pointee_move()`. This transfers the value out of `self` and into `dest` using at most one `__moveinit__()` call. **Safety:** * `self` must be non-null * `self` must contain a valid, initialized instance of `T` * `dst` must not be null * The contents of `dst` should be uninitialized. If `dst` was previously written with a valid value, that value will be be overwritten and its destructor will NOT be run. **Parameters:** * ​T (`Movable`): The type the pointer points to, which must be `Movable`. **Args:** * ​dst (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): Destination pointer that the value will be moved into. --- ## unsetenv `unsetenv(owned name: String) -> Bool` Unsets an environment variable. **Args:** * ​name (`String`): The name of the environment variable. **Returns:** True if unsetting the variable succeeded. Otherwise, False is returned. --- ## unswitch `unswitch[: origin.set, //, switched_func: fn[Bool]() raises capturing -> None](dynamic_switch: Bool)` Performs a functional unswitch transformation. Unswitch is a simple pattern that is similar idea to loop unswitching pass but extended to functional patterns. The pattern facilitates the following code transformation that reduces the number of branches in the generated code Before: ``` for i in range(...) if i switched\_func (`fn[Bool]() raises capturing -> None`): The function containing the inner loop logic that can be unswitched. **Args:** * ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code path. `unswitch[: origin.set, //, switched_func: fn[Bool]() capturing -> None](dynamic_switch: Bool)` Performs a functional unswitch transformation. Unswitch is a simple pattern that is similar idea to loop unswitching pass but extended to functional patterns. The pattern facilitates the following code transformation that reduces the number of branches in the generated code Before: ``` for i in range(...) if i switched\_func (`fn[Bool]() capturing -> None`): The function containing the inner loop logic that can be unswitched. **Args:** * ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code path. `unswitch[: origin.set, //, switched_func: fn[Bool, Bool]() capturing -> None](dynamic_switch_a: Bool, dynamic_switch_b: Bool)` Performs a functional 2-predicates unswitch transformation. **Parameters:** * ​switched\_func (`fn[Bool, Bool]() capturing -> None`): The function containing the inner loop logic that has 2 predicates which can be unswitched. **Args:** * ​dynamic\_switch\_a (`Bool`): The first dynamic condition that enables the outer unswitched code path. * ​dynamic\_switch\_b (`Bool`): The second dynamic condition that enables the inner unswitched code path. --- ## upcast `upcast(layout: Layout, factor: Int) -> Layout` Fuses consecutive elements in a layout to create a coarser layout. This function is useful for converting between different data type granularities, such as from bytes to larger data types like bfloat16 or tf32. **Args:** * ​layout (`Layout`): The layout to upcast. * ​factor (`Int`): The number of consecutive elements to fuse into one. **Returns:** A new layout with adjusted shape and stride for the coarser granularity. --- ## update_frequency_data `update_frequency_data[token_type: DType, //, target: StringSlice[StaticConstantOrigin]](compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], new_tokens: LayoutTensor[token_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContextPtr)` Update the frequency data for the given new tokens. The frequency data is stored in a CSR format. This kernel expects there will be enough padding for each sequence to store the new tokens. --- ## update_w_tile_2d `update_w_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, hw: IndexList[2])` --- ## update_w_tile_3d `update_w_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, hw: IndexList[3])` --- ## use_apple_accelerate_lib `use_apple_accelerate_lib[c_type: DType, a_type: DType, b_type: DType]() -> Bool` --- ## use_i8mm_fn `use_i8mm_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool` --- ## use_vnni_fn `use_vnni_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool` --- ## Using AI coding assistants You can use large language models (LLMs) to accelerate your development with Modular by providing structured context about Modular Platform’s docs and code to your projects . We provide two mechanisms: - `llms.txt` files for broad documentation access. - `.cursorules` files for specific coding guidelines. ## Supply documentation to LLMs with `llms.txt` Modular supports the [llms.txt](https://llmstxt.org/) proposed standard, enabling LLMs to access our documentation at inference time. This allows LLMs the most up-to-date documentation providing more accurate and context-aware responses. Modular provides the following `llms.txt` files: - **llms.txt**: Contains an index of links with brief content descriptions for LLMs to navigate to detailed information. - **llms-full.txt**: Provides all detailed content in a single file, removing the need for navigation. - **llms-mojo.txt**: Includes documentation for the [Mojo standard library](/mojo/lib#standard-library), [MAX AI Kernels](/mojo/lib#max-ai-kernels-library), and [MAX library](/mojo/lib#max-library). - **llms-python.txt**: Contains [MAX Python APIs](/max/api/python/) documentation. ### Integrate `llms.txt` with AI-assisted IDEs You can leverage `llms.txt` files with IDEs that support tool calling, such as [Cursor](https://www.cursor.com/) or [Windsurf](https://windsurf.dev/), to provide context directly within your development environment. For example, when writing Mojo code, you can reference the `llms-mojo.txt` file by using `@docs.modular.com/llms-mojo.txt` in your chat window. Your IDE will then use this documentation to inform its suggestions, completions, and error corrections. ## Enhance LLM guidance with `.cursorules` [`.cursorules`](https://docs.cursor.com/context/rules), also known as project rules, are a powerful way to give LLMs consistent reusable information. These rules are usually stored in a `.cursor/rules` directory right within your project, so they can be version-controlled and specifically scoped to your codebase. You can use Modular's `.cursorules` to assist in coding tasks or working with Modular based projects: - **[`general_behavior_rules.mdc`](https://github.com/modular/modular/blob/main/.cursor/rules/general_behavior_rules.mdc)**: General rules for code creation. Emphasizes simplicity, thorough investigation, using existing solutions, descriptive naming, environment variables for configuration, robust error handling, documentation, assertions, virtual environments, and workspace-relative operations. - **[`git.mdc`](https://github.com/modular/modular/blob/main/.cursor/rules/git.mdc)**: Outlines best practices for using Git effectively. Includes guidance on code organization, commit strategies, branching models, and collaborative workflows. - **[`mojo.mdc`](https://github.com/modular/modular/blob/main/.cursor/rules/mojo.mdc)**: Enforces Mojo coding standards, performance optimizations, and best practices. Aims to ensure efficient and maintainable GPU-accelerated code, with guidance on code organization, memory management, and error handling. --- ## utils ## Aliases ### `elementwise_compute_lambda_type` `alias elementwise_compute_lambda_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]` ### `elementwise_epilogue_type` `alias elementwise_epilogue_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None` ## Structs * [​`GemmShape`](./GemmShape): Helper class to unpack gemm dimension and layout. * [​`InnerKernelID`](./InnerKernelID): * [​`KernelConfig`](./KernelConfig): Static configuration of the matmul inner kernel. * [​`MicroKernelShape`](./MicroKernelShape): Record describing the inner kernel shape. * [​`SubMatmulConfig`](./SubMatmulConfig): Static configuration of sub-matrices in parallel matmul. ## Functions * [​`apply_epilogue`](./apply_epilogue): * [​`calculate_tile_n_k`](./calculate_tile_n_k): Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout. * [​`dispatch_get_kernel_type`](./dispatch_get_kernel_type): * [​`get_kernel_config`](./get_kernel_config): Utility function to extract matmul configuration parameters for exported Functions. TODO: Add target dependent configuration parameters. * [​`get_kernel_type`](./get_kernel_type): * [​`get_matmul_arch_factor`](./get_matmul_arch_factor): * [​`get_matmul_kernel_shape`](./get_matmul_kernel_shape): * [​`get_matmul_kernel_shape_ARM`](./get_matmul_kernel_shape_ARM): * [​`get_matmul_kernel_shape_x86`](./get_matmul_kernel_shape_x86): * [​`get_matmul_num_tasks`](./get_matmul_num_tasks): Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores. * [​`get_matmul_prefetch_b_distance_k`](./get_matmul_prefetch_b_distance_k): * [​`get_min_task_size`](./get_min_task_size): * [​`get_pack_data_size`](./get_pack_data_size): Utility to compute the number of elements to pack in each tile. Returns: The number of elements to pack. * [​`get_packB_unroll_factor`](./get_packB_unroll_factor): * [​`get_partitioned_matmul`](./get_partitioned_matmul): * [​`get_partitioned_matmul_mojo`](./get_partitioned_matmul_mojo): * [​`get_partitioned_matmul_mojo_shape`](./get_partitioned_matmul_mojo_shape): * [​`packA_i8mm`](./packA_i8mm): * [​`partition_work`](./partition_work): * [​`select_inner_kernel`](./select_inner_kernel): * [​`use_i8mm_fn`](./use_i8mm_fn): * [​`use_vnni_fn`](./use_vnni_fn): --- ## utils Implements the utils package. ## Modules * [​`index`](/mojo/stdlib/utils/index_/): Implements `IndexList` which is commonly used to represent N-D indices. * [​`lock`](/mojo/stdlib/utils/lock/): * [​`numerics`](/mojo/stdlib/utils/numerics/): Defines utilities to work with numeric types. * [​`static_tuple`](/mojo/stdlib/utils/static_tuple/): Implements StaticTuple, a statically-sized uniform container. * [​`variant`](/mojo/stdlib/utils/variant/): Defines a Variant type. * [​`write`](/mojo/stdlib/utils/write/): Establishes the contract between `Writer` and `Writable` types. --- ## utils_gpu ## Structs * [​`MatmulConfig`](./MatmulConfig): Static configuration of GPU matmul. * [​`MatmulKernels`](./MatmulKernels): Supported matmul kernels. ## Functions * [​`block_swizzle`](./block_swizzle): * [​`get_config_from_shape`](./get_config_from_shape): * [​`select_config`](./select_config): --- ## valid_length_managed_tensor_slice_to_ndbuffer `valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]` --- ## valid_length_managed_tensor_slice_to_ndbuffer `valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]` --- ## value Defines core value traits. These are Mojo built-ins, so you don't need to import them. ## Traits * [​`Copyable`](/mojo/stdlib/builtin/value/Copyable): The Copyable trait denotes a type whose value can be copied. * [​`Defaultable`](/mojo/stdlib/builtin/value/Defaultable): The `Defaultable` trait describes a type with a default constructor. * [​`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable): The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly. * [​`Movable`](/mojo/stdlib/builtin/value/Movable): The Movable trait denotes a type whose value can be moved. --- ## Value ```c #include "max/c/value.h" ``` ## Functions ### `M_getValueByNameFrom()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getValueByNameFrom([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*valueMap, const char \*valueName, [M\_Status](types.md#_CPPv48M_Status) \*status) Gets a value from the value map by name. * **Parameters:** * **valueMap** – The value map. * **valueName** – The name of the value. * **status** – The status object for reporting errors. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the value map or name are invalid, a `NULL` pointer is returned and the `status` parameter contains an error message. ### `M_getValueFromMapIterator()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getValueFromMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator) Gets the tensor from the tensor map iterator. * **Parameters:** **iterator** – The tensor map iterator. * **Returns:** A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](tensor.md#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the tensor map iterator is invalid, a `NULL` pointer is returned. ### `M_freeValue()` > void M\_freeValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Deallocates the memory for the container. No-op if `value` is `NULL`. * **Parameters:** **value** – The value to deallocate. ### `M_getStringFromValue()` > const char \*M\_getStringFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a string from the async value. * **Parameters:** **value** – The async value. * **Returns:** A null-terminated string if the `value` is valid. Otherwise, `NULL`. The memory associated with the returned string is owned by the `value`. ### `M_createStringAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createStringAsyncValue(const char \*data, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a string wrapped in an `AsyncValue`. * **Parameters:** * **data** – The zero-terminated string data. * **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_getDoubleFromValue()` > double M\_getDoubleFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a double from the async value. * **Parameters:** **value** – The async value. * **Returns:** A double value. ### `M_createDoubleAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createDoubleAsyncValue(double value, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a double value wrapped in an `AsyncValue`. * **Parameters:** * **value** – The double value. * **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_getLongFromValue()` > int64\_t M\_getLongFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a long from the async value. * **Parameters:** **value** – The async value. * **Returns:** A long value. ### `M_createLongAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createLongAsyncValue(int64\_t value, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a long value wrapped in an `AsyncValue`. * **Parameters:** * **value** – The long value. * **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_getBoolFromValue()` > bool M\_getBoolFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a boolean from the async value. * **Parameters:** **value** – The async value. * **Returns:** A boolean value. ### `M_createBoolAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createBoolAsyncValue(bool value, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a boolean value wrapped in an `AsyncValue`. * **Parameters:** * **value** – The boolean value. * **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_borrowValueInto()` > void M\_borrowValueInto([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensors, const char \*name, const [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value, [M\_Status](types.md#_CPPv48M_Status) \*status) Adds a value to the tensor map. You are responsible for the lifetime of the input value. It gets “borrowed” into the `TensorMap`. * **Parameters:** * **tensors** – The tensor map, from [`M_newAsyncTensorMap()`](tensor.md#tensor_8h_1a18039c6e6c1769b947120b27178306eb). * **name** – The zero-terminated string data, representing the name of the value. * **value** – The input value. * **status** – The status object for reporting errors. ### `M_getValueType()` > [M\_ValueType](types.md#_CPPv411M_ValueType) M\_getValueType([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Returns the type contained in the underlying value. * **Parameters:** **value** – The async value. * **Returns:** An enum describing the type of the underlying value. Returns `M_UNKNOWN_VALUE` for unsupported values and if the value is invalid. ### `M_getDictFromValue()` > [M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*M\_getDictFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a `Dict` from the async value. * **Parameters:** **value** – The async value. * **Returns:** A pointer to the `Dict`. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeDict()`](#value_8h_1a4578bec6c4257a48ecc05ef358c464a5). The held `Dict` inside the return value is simply borrowed from the `M_AsyncValue`. If the value is invalid or not a `Dict`, a `NULL` pointer is returned. ### `M_createDictAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createDictAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates an empty `Dict` wrapped in an `AsyncValue`. * **Parameters:** **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_insertIntoDict()` > void M\_insertIntoDict([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*key, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Inserts a key-value pair to the `Dict`. You are responsible for the lifetime of the key and value. Their data gets “borrowed” into the `Dict`. No-op if either the dict, key or value are invalid. * **Parameters:** * **dict** – The dict to insert into. * **key** – The key to insert. * **value** – The value to insert. ### `M_getListFromValue()` > [M\_AsyncList](types.md#_CPPv411M_AsyncList) \*M\_getListFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a `List` from the async value. * **Parameters:** **value** – The async value. * **Returns:** A pointer to the `List`. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeList()`](#value_8h_1a653e01f359ce0579b4bb7e7b6a0c286c). The held `List` inside the return value is simply borrowed from the `M_AsyncValue`. If the value is invalid or not a `List`, a `NULL` pointer is returned. ### `M_createListAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createListAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates an empty `List` wrapped in an `AsyncValue`. * **Parameters:** **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_appendToList()` > void M\_appendToList([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Appends a value to the `List`. You are responsible for the lifetime of the value. Its data gets “borrowed” into the `List`. No-op if either the list or value are invalid. * **Parameters:** * **list** – The list to append onto. * **value** – The value to append. ### `M_getTupleFromValue()` > [M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*M\_getTupleFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Gets a `Tuple` from the async value. * **Parameters:** **value** – The async value. * **Returns:** A pointer to the `Tuple`. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTuple()`](#value_8h_1a8bb2dfb3040465617541d2819e3b3e46). The held `Tuple` inside the return value is simply borrowed from the `M_AsyncValue`. If the value is invalid or not a `Tuple`, a `NULL` pointer is returned. ### `M_borrowIntoTuple()` > void M\_borrowIntoTuple([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*tuple, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value) Adds a value to the `Tuple`. You are responsible for the lifetime of the value. Its data gets “borrowed” into the `Tuple`. No-op if either the tuple or value are invalid. * **Parameters:** * **tuple** – The tuple to add into. * **value** – The value to add. ### `M_createTupleAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createTupleAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates an empty `Tuple` wrapped in an `AsyncValue`. * **Parameters:** **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_getDictSize()` > size\_t M\_getDictSize([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict) Returns the number of elements in the `Dict`. * **Parameters:** **dict** – The dict. * **Returns:** The number of elements in the `Dict`. Returns 0 if the dict is invalid. ### `M_getListSize()` > size\_t M\_getListSize([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list) Returns the number of elements in the `List`. * **Parameters:** **list** – The list. * **Returns:** The number of elements in the `List`. Returns 0 if the list is invalid. ### `M_getTupleSize()` > size\_t M\_getTupleSize([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*tuple) Returns the number of elements in the `Tuple`. * **Parameters:** **tuple** – The tuple. * **Returns:** The number of elements in the `Tuple`. Returns 0 if the tuple is invalid. ### `M_getDictKey()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getDictKey([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict, size\_t i) Returns the dict key at position `i`. * **Parameters:** * **dict** – The dict. * **i** – The index to return. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the dict is invalid or the index out of bounds, a `NULL` pointer is returned. ### `M_getDictValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getDictValue([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict, size\_t i) Returns the dict value at position `i`. * **Parameters:** * **dict** – The dict. * **i** – The index to return. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the dict is invalid or the index out of bounds, a `NULL` pointer is returned. ### `M_getListValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getListValue([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list, size\_t i) Returns the list value at position `i`. * **Parameters:** * **list** – The list. * **i** – The index to return. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the list is invalid or the index out of bounds, a `NULL` pointer is returned. ### `M_getTupleValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getTupleValue([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*tuple, size\_t i) Returns the tuple value at position `i`. * **Parameters:** * **tuple** – The tuple. * **i** – The index to return. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the tuple is invalid or the index out of bounds, a `NULL` pointer is returned. ### `M_createNoneAsyncValue()` > [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createNoneAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context) Creates a `None` value wrapped in an `AsyncValue`. * **Parameters:** **context** – The runtime context. * **Returns:** A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. ### `M_freeDict()` > void M\_freeDict([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict) Deallocates the memory for the dictionary. No-op if `dict` is `NULL`. * **Parameters:** **dict** – The dictionary to deallocate. ### `M_freeList()` > void M\_freeList([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list) Deallocates the memory for the list. No-op if `list` is `NULL`. * **Parameters:** **list** – The list to deallocate. ### `M_freeTuple()` > void M\_freeTuple([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*list) Deallocates the memory for the tuple. No-op if `tuple` is `NULL`. * **Parameters:** **list** – The list to deallocate. ### `M_freeNone()` > void M\_freeNone([M\_AsyncNone](types.md#_CPPv411M_AsyncNone) \*none) Deallocates the memory for the none value. No-op if `none` is `NULL`. * **Parameters:** **none** – The async none to deallocate. --- ## Value ## `Value` {#max.graph.Value} > *class* max.graph.Value Represents a symbolic value within a Graph. A Value can represent the output of a node, the arguments of a Graph (as seen from within its body), and more generally any symbolic value available within the Graph. Other nodes receive Value values as inputs to form a computation graph. A Value may also refer to an existing input or output of a node, and you can change them, such as by swapping a new Value. Conceptually, think of a Value as an edge in the dataflow graph, with the other end being the user of that value. The following example shows how to work with Values in a graph to create a simple computation: ```python from max.graph import Graph, ops, Value from max.dtype import DType import numpy as np with Graph("value_example") as graph: # Create input values a = ops.constant(np.array([1, 2, 3]), dtype=DType.float32, device=DeviceRef.CPU()) b = ops.constant(np.array([4, 5, 6]), dtype=DType.float32, device=DeviceRef.CPU()) # Use values to perform operations c = a + b # c is a Value representing the addition # Demonstrate that the result is a Value print(f"Type of c: {type(c)}") print(f"Is c a Value? {isinstance(c, Value)}") ``` Similar to a regular variable, a Value has a data type. Value is abstract, it shouldn’t be constructed directly. ### `buffer` {#max.graph.Value.buffer} > *property* buffer\*: [BufferValue](BufferValue.md#max.graph.BufferValue)\* Returns the Value as a [`BufferValue`](BufferValue.md#max.graph.BufferValue). Raises an exception if the Value is not a BufferValue. ### `from_mlir()` {#max.graph.Value.from_mlir} > *classmethod* from\_mlir(value: Value\[TensorType]) → [TensorValue](TensorValue.md#max.graph.TensorValue) > *classmethod* from\_mlir(value: Value\[BufferType]) → [BufferValue](BufferValue.md#max.graph.BufferValue) > *classmethod* from\_mlir(value: Value\[OpaqueType]) → \_OpaqueValue > *classmethod* from\_mlir(value: Value\[ChainType]) → \_ChainValue ### `opaque` {#max.graph.Value.opaque} > *property* opaque\*: \_OpaqueValue\* Returns the Value as an `_OpaqueValue`. Raises an exception if the Value is not a \_OpaqueValue. ### `tensor` {#max.graph.Value.tensor} > *property* tensor\*: [TensorValue](TensorValue.md#max.graph.TensorValue)\* Returns the Value as a [`TensorValue`](TensorValue.md#max.graph.TensorValue). Raises an exception if the Value is not a TensorValue. ### `type` {#max.graph.Value.type} > *property* type\*: [Type](type.md#max.graph.type.Type)\* Returns the type of the [`Value`](#max.graph.Value) as a `Type`. --- ## Value semantics Mojo doesn't enforce value semantics or reference semantics. It supports them both and allows each type to define how it is created, copied, and moved (if at all). So, if you're building your own type, you can implement it to support value semantics, reference semantics, or a bit of both. That said, Mojo is designed with argument behaviors that default to value semantics, and it provides tight controls for reference semantics that avoid memory errors. The controls over reference semantics are provided by the [value ownership model](/mojo/manual/values/ownership), but before we get into the syntax and rules for that, it's important that you understand the principles of value semantics. Generally, it means that each variable has unique access to a value, and any code outside the scope of that variable cannot modify its value. ## Intro to value semantics In the most basic situation, sharing a value-semantic type means that you create a copy of the value. This is also known as "pass by value." For example, consider this code: ```mojo def main(): var x = 1 var y = x y += 1 print("x:", x) print("y:", y) ``` ```output x: 1 y: 2 ``` We assigned the value of `x` to `y`, which creates the value for `y` by making a copy of `x`. When we increment `y`, the value of `x` doesn't change. Each variable has exclusive ownership of a value. Whereas, if a type instead uses reference semantics, then `y` would point to the same value as `x`, and incrementing either one would affect the value for both. Neither `x` nor `y` would "own" the value, and any variable would be allowed to reference it and mutate it. Numeric values in Mojo are value semantic because they're trivial types, which are cheap to copy. ## Value semantics in Mojo functions Value semantics also apply to function arguments in Mojo by default. However, the way in which they apply differs depending on whether you define the function using `def` or `fn`. You can also override the default behavior by providing an explicit [argument convention](/mojo/manual/values/ownership#argument-conventions), which is discussed in the [Ownership](/mojo/manual/values/ownership) page. ### Value semantics in `def` functions Here's an example with a function defined using `def`: ```mojo def add_one(y: Int): # def creates an implicit copy of the value because it's mutated y += 1 print("y:", y) def main(): var x = 1 add_one(x) print("x:", x) ``` ```output y: 2 x: 1 ``` Again, the `y` value is a copy and the function cannot modify the original `x` value. If you're familiar with Python, this is probably familiar so far, because the code above behaves the same in Python. However, Python is not value semantic. It gets complicated, but let's consider a situation in which you call a Python function and pass an object with a pointer to a heap-allocated value. Python actually gives that function a reference to your object, which allows the function to mutate the heap-allocated value. This can cause nasty bugs if you're not careful, because the function might incorrectly assume it has unique ownership of that object. In Mojo, the default behavior for all function arguments is to use value semantics. If the function wants to modify the value of an incoming argument, then it must explicitly declare so, which avoids accidental mutations of the original value. All Mojo types passed to a `def` function can be treated as mutable, which maintains the expected mutability behavior from Python. But by default, it is mutating a uniquely-owned value, not the original value. For example, when you pass an instance of a `SIMD` vector to a `def` function it creates a unique copy of all values. Thus, if we modify the argument in the function, the original value is unchanged: ```mojo def update_simd(t: SIMD[DType.int32, 4]): t[0] = 9 print("t:", t) def main(): var v = SIMD[DType.int32, 4](1, 2, 3, 4) update_simd(v) print("v:", v) ``` ```output t: [9, 2, 3, 4] v: [1, 2, 3, 4] ``` If this were Python code, the function would modify the original object, because Python shares a reference to the original object. However, not all types are inexpensive to copy. Copying a `String` or `List` requires allocating heap memory, so we want to avoid copying one by accident. When designing a type like this, ideally you want to prevent *implicit* copies, and only make a copy when it's explicitly requested. ### Value semantics in `fn` functions The arguments above are mutable because a function defined with `def` has special treatment for the default [`read` argument convention](/mojo/manual/values/ownership#argument-conventions). In contrast, `fn` functions always receive `read` arguments as immutable references. This is a memory optimization to avoid making unnecessary copies. For example, let's create another function with the `fn` declaration. In this case, the `y` argument is immutable by default, so if the function wants to modify the value in the local scope, it needs to make a local copy: ```mojo fn add_two(y: Int): # y += 2 # This would cause a compiler error because `y` is immutable # We can instead make an explicit copy: var z = y z += 2 print("z:", z) def main(): var x = 1 add_two(x) print("x:", x) ``` ```output z: 3 x: 1 ``` This is all consistent with value semantics because each variable maintains unique ownership of its value. The way the `fn` function receives the `y` value is a "look but don't touch" approach to value semantics. This is also a more memory-efficient approach when dealing with memory-intensive arguments, because Mojo doesn't make any copies unless we explicitly make the copies ourselves. Thus, the default behavior for `def` and `fn` arguments is fully value semantic: arguments are either copies or immutable references, and any living variable from the callee is not affected by the function. But we must also allow reference semantics (mutable references) because it's how we build performant and memory-efficient programs (making copies of everything gets really expensive). The challenge is to introduce reference semantics in a way that does not disturb the predictability and safety of value semantics. The way we do that in Mojo is, instead of enforcing that every variable have "exclusive access" to a value, we ensure that every value has an "exclusive owner," and destroy each value when the lifetime of its owner ends. On the next page about [value ownership](/mojo/manual/values/ownership), you'll learn how to modify the default argument conventions, and safely use reference semantics so every value has only one owner at a time. --- ## ValueOrUnknown `struct ValueOrUnknown[dim: Int = -1]` Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime). ## Parameters * ​dim (`Int`): Optional compile-time dimension value. Default is `UNKNOWN_VALUE` for dynamic dimensions. ## Fields * ​value (`Int`): The runtime value of the dimension. For static dimensions, this is set to the compile-time value. For dynamic dimensions, this is set at runtime. ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self)` Initializes a static dimension with compile-time value. Note: Fails to compile if dim is `UNKNOWN_VALUE`, as dynamic dimensions require a runtime value. `@implicit` `__init__(out self, v: Int)` Initializes a dynamic dimension with runtime value. **Args:** * ​v (`Int`): Runtime value for the dimension. --- ## Variables A variable is a name that holds a value or object. All variables in Mojo are mutable—their value can be changed. (If you want to define a constant value that can't change at runtime, see the [`alias` keyword](/mojo/manual/parameters/#alias-named-parameter-expressions).) Mojo has two kinds of variables: * Explicitly-declared variables are created with the `var` keyword, and may include [type annotations](#type-annotations). ```mojo var a = 5 var b: Float64 = 3.14 ``` * Implicitly-declared variables are created with an assignment statement: ```mojo a = 5 b = 3.14 ``` Both types of variables are strongly-typed: the variable receives a type when it's created, and the type never changes. You can't assign a variable a value of a different type: ```mojo count = 8 # count is type Int count = "Nine?" # Error: can't implicitly convert 'StringLiteral' to 'Int' ``` Some types support [*implicit conversions*](#implicit-type-conversion) from other types. For example, an integer value can implicitly convert to a floating-point value: ```mojo var temperature: Float64 = 99 print(temperature) ``` ```output 99.0 ``` In this example, the `temperature` variable is explicitly typed as `Float64`, but assigned an integer value, so the value is implicitly converted to a `Float64`. ## Implicitly-declared variables You can create a variable with just a name and a value. For example: ```mojo name = String("Sam") user_id = 0 ``` Implicitly-declared variables are strongly typed: they take the type from the first value assigned to them. For example, the `user_id` variable above is type `Int`, while the `name` variable is type `String`. You can't assign a string to `user_id` or an integer to `name`. Implicitly-declared variables are scoped at the function level. You create an implicitly-declared variable the first time you assign a value to a given name inside a function. Any subsequent references to that name inside the function refer to the same variable. For more information, see [Variable scopes](#variable-scopes), which describes how variable scoping differs between explicitly- and implicitly-declared variables. ## Explicitly-declared variables You can declare a variable with the `var` keyword. For example: ```mojo var name = String("Sam") var user_id: Int ``` The `name` variable is initialized to the string "Sam". The `user_id` variable is uninitialized, but it has a declared type, `Int` for an integer value. All explicitly-declared variables are typed—either explicitly with a [type annotation](#type-annotations) or implicitly when they're initialized with a value. Since variables are strongly typed, you can't assign a variable a value of a different type, unless those types can be [implicitly converted](#implicit-type-conversion). For example, this code will not compile: ```mojo var user_id: Int = "Sam" ``` There are several main differences between explicitly-declared variables and implicitly-declared variables: * An explicitly-declared variable can be declared without initializing it: ```mojo var value: Float64 ``` * Explicitly-declared variables follow [lexical scoping](#variable-scopes), unlike implicitly-declared variables. ## Type annotations Although Mojo can infer a variable type from the first value assigned to a variable, it also supports static type annotations on variables. Type annotations provide a more explicit way of specifying the variable's type. To specify the type for a variable, add a colon followed by the type name: ```mojo var name: String = get_name() ``` This makes it clear that `name` is type `String`, without knowing what the `get_name()` function returns. The `get_name()` function may return a `String`, or a value that's implicitly convertible to a `String`. :::note You must declare a variable with `var` to use type annotations. ::: If a type has a constructor with just one argument, you can initialize it in two ways: ```mojo var name1: String = "Sam" var name2 = String("Sam") ``` Both of these lines invoke the same constructor to create a `String` from a `StringLiteral`. ### Late initialization Using type annotations allows for late initialization. For example, notice here that the `z` variable is first declared with just a type, and the value is assigned later: ```mojo fn my_function(x: Int): var z: Float32 if x != 0: z = 1.0 else: z = foo() print(z) fn foo() -> Float32: return 3.14 ``` If you try to pass an uninitialized variable to a function or use it on the right-hand side of an assignment statement, compilation fails. ```mojo var z: Float32 var y = z # Error: use of uninitialized value 'z' ``` :::note Late initialization works only if the variable is declared with a type. ::: ### Implicit type conversion Some types include built-in type conversion (type casting) from one type into its own type. For example, if you assign an integer to a variable that has a floating-point type, it converts the value instead of giving a compiler error: ```mojo var number: Float64 = Int(1) print(number) ``` ```output 1.0 ``` As shown above, value assignment can be converted into a constructor call if the target type has a constructor that meets the following criteria: - It's decorated with the `@implicit` decorator. - It takes a single required argument that matches the value being assigned. So, this code uses the `Float64` constructor that takes an integer: `__init__(out self, value: Int)`. In general, implicit conversions should only be supported where the conversion is lossless. Implicit conversion follows the logic of [overloaded functions](/mojo/manual/functions#overloaded-functions). If the destination type has a viable implicit conversion constructor for the source type, it can be invoked for implicit conversion. So assigning an integer to a `Float64` variable is exactly the same as this: ```mojo var number = Float64(1) ``` Similarly, if you call a function that requires an argument of a certain type (such as `Float64`), you can pass in any value as long as that value type can implicitly convert to the required type (using one of the type's overloaded constructors). For example, you can pass an `Int` to a function that expects a `Float64`, because `Float64` includes an implicit conversion constructor that takes an `Int`: ```mojo fn take_float(value: Float64): print(value) fn pass_integer(): var value: Int = 1 take_float(value) ``` For more details on implicit conversion, see [Constructors and implicit conversion](/mojo/manual/lifecycle/life/#constructors-and-implicit-conversion). ## Variable scopes Variables declared with `var` are bound by *lexical scoping*. This means that nested code blocks can read and modify variables defined in an outer scope. But an outer scope **cannot** read variables defined in an inner scope at all. For example, the `if` code block shown here creates an inner scope where outer variables are accessible to read/write, but any new variables do not live beyond the scope of the `if` block: ```mojo def lexical_scopes(): var num = 1 var dig = 1 if num == 1: print("num:", num) # Reads the outer-scope "num" var num = 2 # Creates new inner-scope "num" print("num:", num) # Reads the inner-scope "num" dig = 2 # Updates the outer-scope "dig" print("num:", num) # Reads the outer-scope "num" print("dig:", dig) # Reads the outer-scope "dig" lexical_scopes() ``` ```output num: 1 num: 2 num: 1 dig: 2 ``` Note that the `var` statement inside the `if` creates a **new** variable with the same name as the outer variable. This prevents the inner loop from accessing the outer `num` variable. (This is called "variable shadowing," where the inner scope variable hides or "shadows" a variable from an outer scope.) The lifetime of the inner `num` ends exactly where the `if` code block ends, because that's the scope in which the variable was defined. This is in contrast to implicitly-declared variables (those without the `var` keyword), which use **function-level scoping** (consistent with Python variable behavior). That means, when you change the value of an implicitly-declared variable inside the `if` block, it actually changes the value for the entire function. For example, here's the same code but *without* the `var` declarations: ```mojo def function_scopes(): num = 1 if num == 1: print(num) # Reads the function-scope "num" num = 2 # Updates the function-scope variable print(num) # Reads the function-scope "num" print(num) # Reads the function-scope "num" function_scopes() ``` ```output 1 2 2 ``` Now, the last `print()` function sees the updated `num` value from the inner scope, because implicitly-declared variables (Python-style variables) use function-level scope (instead of lexical scope). --- ## VariadicList `@register_passable(trivial)` `struct VariadicList[type: AnyTrivialRegType]` A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed. ## Parameters * ​type (`AnyTrivialRegType`): The type of the elements in the list. ## Fields * ​value (`Variadic[type]`): The underlying storage for the variadic list. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Aliases ### `IterType` `alias IterType = _VariadicListIter[type]` ## Methods ### `__init__` `@implicit` `__init__(*value: type) -> Self` Constructs a VariadicList from a variadic list of arguments. **Args:** * ​\*value (`type`): The variadic argument list to construct the variadic list with. ### `__getitem__` `__getitem__[I: Indexer](self, idx: I) -> type` Gets a single element on the variadic list. **Parameters:** * ​I (`Indexer`): A type that can be used as an index. **Args:** * ​idx (`I`): The index of the element to access on the list. **Returns:** The element on the list corresponding to the given index. ### `__len__` `__len__(self) -> Int` Gets the size of the list. **Returns:** The number of elements on the variadic list. ### `__iter__` `__iter__(self) -> _VariadicListIter[type]` Iterate over the list. **Returns:** An iterator to the start of the list. --- ## VariadicListMem `struct VariadicListMem[elt_is_mutable: Bool, //, element_type: AnyType, origin: Origin[elt_is_mutable], is_owned: Bool]` A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated. Each element may be accessed with `elt[]`. ## Parameters * ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an mut or owned argument. * ​element\_type (`AnyType`): The type of the elements in the list. * ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements. * ​is\_owned (`Bool`): Whether the elements are owned by the list. ## Fields * ​value (`Variadic[ref [origin._mlir_origin] element_type]`): The underlying storage, a variadic list of references to elements of the given type. ## Implemented traits `AnyType`, `Sized`, `UnknownDestructibility` ## Aliases ### `reference_type` `alias reference_type = Pointer[element_type, origin]` ## Methods ### `__moveinit__` `__moveinit__(out self, owned existing: Self)` Moves constructor. **Args:** * ​existing (`Self`): The existing VariadicListMem. ### `__del__` `__del__(owned self)` Destructor that releases elements if owned. ### `__getitem__` `__getitem__(self, idx: Int) -> ref [origin, *[0,0]] element_type` Gets a single element on the variadic list. **Args:** * ​idx (`Int`): The index of the element to access on the list. **Returns:** A low-level pointer to the element on the list corresponding to the given index. ### `__len__` `__len__(self) -> Int` Gets the size of the list. **Returns:** The number of elements on the variadic list. ### `__iter__` `__iter__(self, out result: _VariadicListMemIter[element_type, origin, self, is_owned])` Iterate over the list. **Returns:** An iterator to the start of the list. --- ## VariadicPack `@register_passable` `struct VariadicPack[elt_is_mutable: Bool, //, is_owned: Bool, origin: Origin[elt_is_mutable], element_trait: AnyTrait[AnyType], *element_types: element_trait]` A utility class to access variadic pack arguments and provide an API for doing things with them. ## Parameters * ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an mut or owned argument pack. * ​is\_owned (`Bool`): Whether the elements are owned by the pack. If so, the pack will release the elements when it is destroyed. * ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements. * ​element\_trait (`AnyTrait[AnyType]`): The trait that each element of the pack conforms to. * ​\*element\_types (`element_trait`): The list of types held by the argument pack. ## Implemented traits `AnyType`, `Sized`, `UnknownDestructibility` ## Methods ### `__del__` `__del__(owned self)` Destructor that releases elements if owned. ### `__getitem__` `__getitem__[index: Int](self) -> ref [origin] element_types[index.value]` Return a reference to an element of the pack. **Parameters:** * ​index (`Int`): The element of the pack to return. **Returns:** A reference to the element. The Pointer's mutability follows the mutability of the pack argument convention. ### `__len__` `static __len__() -> Int` Return the VariadicPack length. **Returns:** The number of elements in the variadic pack. `__len__(self) -> Int` Return the VariadicPack length. **Returns:** The number of elements in the variadic pack. ### `each` `each[func: fn[element_trait]($0) capturing -> None](self)` Apply a function to each element of the pack in order. This applies the specified function (which must be parametric on the element type) to each element of the pack, from the first element to the last, passing in each element as a read-only argument. **Parameters:** * ​func (`fn[element_trait]($0) capturing -> None`): The function to apply to each element. ### `each_idx` `each_idx[func: fn[Int, element_trait]($1) capturing -> None](self)` Apply a function to each element of the pack in order. This applies the specified function (which must be parametric on the element type) to each element of the pack, from the first element to the last, passing in each element as a read-only argument. **Parameters:** * ​func (`fn[Int, element_trait]($1) capturing -> None`): The function to apply to each element. --- ## variadics Implements the VariadicList and VariadicPack types. These are Mojo built-ins, so you don't need to import them. ## Structs * [​`VariadicList`](/mojo/stdlib/builtin/variadics/VariadicList): A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed. * [​`VariadicListMem`](/mojo/stdlib/builtin/variadics/VariadicListMem): A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated. Each element may be accessed with `elt[]`. * [​`VariadicPack`](/mojo/stdlib/builtin/variadics/VariadicPack): A utility class to access variadic pack arguments and provide an API for doing things with them. --- ## VariadicTensors `@register_passable(trivial)` `struct VariadicTensors[mut: Bool, input: IO, //, type: DType, rank: Int, size: Int, io_spec: IOSpec[mut, input], *, static_specs: StaticTuple[StaticTensorSpec[type, rank], size]]` A tuple-like container of tensors representing variadic arguments from the graph compiler. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Sized`, `UnknownDestructibility` ## Methods ### `__getitem__` `__getitem__[index: Int](self) -> ManagedTensorSlice[io_spec, static_spec=static_specs.__getitem__[::Indexer](index)]` Returns the tensor at the given position in the variadic argument argument pack. **Parameters:** * ​index (`Int`): The index into the variadic tensor arguments. **Returns:** The tensor at the specified index. ### `__len__` `__len__(self) -> Int` Returns the number of variadic arguments in the pack. **Returns:** The number of variadic arguments. --- ## variance `variance(src: NDBuffer[type, 1, origin], mean_value: SIMD[type, 1], correction: Int = 1) -> SIMD[type, 1]` Given a mean, computes the variance of elements in a buffer. The mean value is used to avoid a second pass over the data: ``` variance(x) = sum((x - E(x))^2) / (size - correction) ``` **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. * ​mean\_value (`SIMD[type, 1]`): The mean value of the buffer. * ​correction (`Int`): Normalize variance by size - correction. **Returns:** The variance value of the elements in a buffer. `variance(src: NDBuffer[type, 1, origin], correction: Int = 1) -> SIMD[type, 1]` Computes the variance value of the elements in a buffer. ``` variance(x) = sum((x - E(x))^2) / (size - correction) ``` **Args:** * ​src (`NDBuffer[type, 1, origin]`): The buffer. * ​correction (`Int`): Normalize variance by size - correction (Default=1). **Returns:** The variance value of the elements in a buffer. --- ## variant Defines a Variant type. You can use this type to implement variant/sum types. For example: ```mojo from utils import Variant alias IntOrString = Variant[Int, String] fn to_string(mut x: IntOrString) -> String: if x.isa[String](): return x[String] # x.isa[Int]() return String(x[Int]) # They have to be mutable for now, and implement Copyable & Movable var an_int = IntOrString(4) var a_string = IntOrString(String("I'm a string!")) var who_knows = IntOrString(0) import random if random.random_ui64(0, 1): who_knows.set[String]("I'm actually a string too!") print(to_string(an_int)) print(to_string(a_string)) print(to_string(who_knows)) ``` ## Structs * [​`Variant`](/mojo/stdlib/utils/variant/Variant): A runtime-variant type. --- ## Variant `struct Variant[*Ts: Copyable & Movable]` A runtime-variant type. Data for this type is stored internally. Currently, its size is the largest size of any of its variants plus a 16-bit discriminant. You can \- use `isa[T]()` to check what type a variant is \- use `unsafe_take[T]()` to take a value from the variant \- use `[T]` to get a value out of a variant \- This currently does an extra copy/move until we have origins \- It also temporarily requires the value to be mutable \- use `set[T](owned new_value: T)` to reset the variant to a new value \- use `is_type_supported[T]` to check if the variant permits the type `T` Example: ```mojo from utils import Variant alias IntOrString = Variant[Int, String] fn to_string(mut x: IntOrString) -> String: if x.isa[String](): return x[String] # x.isa[Int]() return String(x[Int]) # They have to be mutable for now, and implement Copyable & Movable var an_int = IntOrString(4) var a_string = IntOrString(String("I'm a string!")) var who_knows = IntOrString(0) import random if random.random_ui64(0, 1): who_knows.set[String]("I'm actually a string too!") print(to_string(an_int)) print(to_string(a_string)) print(to_string(who_knows)) ``` ## Parameters * ​\*Ts (`Copyable & Movable`): The elements of the variadic. ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(out self, *, unsafe_uninitialized: Tuple[])` Unsafely create an uninitialized Variant. **Args:** * ​unsafe\_uninitialized (`Tuple[]`): Marker argument indicating this initializer is unsafe. `@implicit` `__init__[T: Copyable & Movable](out self, owned value: T)` Create a variant with one of the types. **Parameters:** * ​T (`Copyable & Movable`): The type to initialize the variant to. Generally this should be able to be inferred from the call type, eg. `Variant[Int, String](4)`. **Args:** * ​value (`T`): The value to initialize the variant with. ### `__copyinit__` `__copyinit__(out self, other: Self)` Creates a deep copy of an existing variant. **Args:** * ​other (`Self`): The variant to copy from. ### `__moveinit__` `__moveinit__(out self, owned other: Self)` Move initializer for the variant. **Args:** * ​other (`Self`): The variant to move. ### `__del__` `__del__(owned self)` Destroy the variant. ### `__getitem__` `__getitem__[T: Copyable & Movable](ref self) -> ref [self] T` Get the value out of the variant as a type-checked type. This explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, the program will abort! For now this has the limitations that it \- requires the variant value to be mutable **Parameters:** * ​T (`Copyable & Movable`): The type of the value to get out. **Returns:** A reference to the internal data. ### `copy` `copy(self, out copy: Self)` Explicitly creates a deep copy of an existing variant. **Returns:** A copy of the value. ### `take` `take[T: Copyable & Movable](mut self) -> T` Take the current value of the variant with the provided type. The caller takes ownership of the underlying value. This explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, the program will abort! **Parameters:** * ​T (`Copyable & Movable`): The type to take out. **Returns:** The underlying data to be taken out as an owned value. ### `unsafe_take` `unsafe_take[T: Copyable & Movable](mut self) -> T` Unsafely take the current value of the variant with the provided type. The caller takes ownership of the underlying value. This doesn't explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, you'll get a type that *looks* like your type, but has potentially unsafe and garbage member data. **Parameters:** * ​T (`Copyable & Movable`): The type to take out. **Returns:** The underlying data to be taken out as an owned value. ### `replace` `replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout` Replace the current value of the variant with the provided type. The caller takes ownership of the underlying value. This explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, the program will abort! **Parameters:** * ​Tin (`Copyable & Movable`): The type to put in. * ​Tout (`Copyable & Movable`): The type to take out. **Args:** * ​value (`Tin`): The value to put in. **Returns:** The underlying data to be taken out as an owned value. ### `unsafe_replace` `unsafe_replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout` Unsafely replace the current value of the variant with the provided type. The caller takes ownership of the underlying value. This doesn't explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, you'll get a type that *looks* like your type, but has potentially unsafe and garbage member data. **Parameters:** * ​Tin (`Copyable & Movable`): The type to put in. * ​Tout (`Copyable & Movable`): The type to take out. **Args:** * ​value (`Tin`): The value to put in. **Returns:** The underlying data to be taken out as an owned value. ### `set` `set[T: Copyable & Movable](mut self, owned value: T)` Set the variant value. This will call the destructor on the old value, and update the variant's internal type and data to the new value. **Parameters:** * ​T (`Copyable & Movable`): The new variant type. Must be one of the Variant's type arguments. **Args:** * ​value (`T`): The new value to set the variant to. ### `isa` `isa[T: Copyable & Movable](self) -> Bool` Check if the variant contains the required type. **Parameters:** * ​T (`Copyable & Movable`): The type to check. **Returns:** True if the variant contains the requested type. ### `unsafe_get` `unsafe_get[T: Copyable & Movable](ref self) -> ref [self] T` Get the value out of the variant as a type-checked type. This doesn't explicitly check that your value is of that type! If you haven't verified the type correctness at runtime, you'll get a type that *looks* like your type, but has potentially unsafe and garbage member data. For now this has the limitations that it \- requires the variant value to be mutable **Parameters:** * ​T (`Copyable & Movable`): The type of the value to get out. **Returns:** The internal data represented as a `Pointer[T]`. ### `is_type_supported` `static is_type_supported[T: Copyable & Movable]() -> Bool` Check if a type can be used by the `Variant`. Example: ```mojo from utils import Variant def takes_variant(mut arg: Variant): if arg.is_type_supported[Float64](): arg = Float64(1.5) def main(): var x = Variant[Int, Float64](1) takes_variant(x) if x.isa[Float64](): print(x[Float64]) # 1.5 ``` For example, the `Variant[Int, Bool]` permits `Int` and `Bool`. **Parameters:** * ​T (`Copyable & Movable`): The type of the value to check support for. **Returns:** `True` if type `T` is supported by the `Variant`. --- ## vec_int__ `vec_int__(gpr: Int)` Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`. --- ## vecfp `vecfp(gpr: Int)` Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`. --- ## vectorize `vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)` Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations. The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine: ```mojo from algorithm.functional import vectorize from memory import UnsafePointer # The amount of elements to loop through alias size = 10 # How many Dtype.int32 elements fit into the SIMD register (4 on 128bit) alias simd_width = simdwidthof[DType.int32]() # assumed to be 4 in this example fn main(): var p = UnsafePointer[Int32].alloc(size) # @parameter allows the closure to capture the `p` pointer @parameter fn closure[width: Int](i: Int): print("storing", width, "els at pos", i) p.store[width=width](i, i) vectorize[closure, simd_width](size) print(p.load[width=simd_width]()) print(p.load[width=simd_width](simd_width)) ``` On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in two separate iterations: ```plaintext storing 4 els at pos 0 storing 4 els at pos 4 storing 1 els at pos 8 storing 1 els at pos 9 [0, 0, 0, 0, 4, 4, 4, 4, 8, 9] ``` You can also unroll the loop to potentially improve performance at the cost of binary size: ``` vectorize[closure, width, unroll_factor=2](size) ``` In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode: ``` closure[4](0) closure[4](4) # Remainder loop won't unroll unless `size` is passed as a parameter for i in range(8, 10): closure[1](i) closure[1](i) ``` You can pass `size` as a parameter if it's compile time known to reduce the iterations for the remainder. This only occurs if the remainder is an exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for performance improvements if not an exponent of 2. **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body. * ​simd\_width (`Int`): The SIMD vector width. * ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1). **Args:** * ​size (`Int`): The upper limit for the loop. `vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = size if is_nvidia_gpu() else 1]()` Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in a single iteration if it's an exponent of 2. The below example demonstrates how you could improve the performance of a loop, by setting multiple values at the same time using SIMD registers on the machine: ```mojo from algorithm.functional import vectorize from memory import UnsafePointer # The amount of elements to loop through alias size = 10 # How many Dtype.int32 elements fit into the SIMD register (4 on 128bit) alias simd_width = simdwidthof[DType.int32]() # assumed to be 4 in this example fn main(): var p = UnsafePointer[Int32].alloc(size) # @parameter allows the closure to capture the `p` pointer @parameter fn closure[width: Int](i: Int): print("storing", width, "els at pos", i) p.store[width=width](i, i) vectorize[closure, simd_width](size) print(p.load[width=simd_width]()) print(p.load[width=simd_width](simd_width)) ``` On a machine with a SIMD register size of 128, this will set 4xInt32 values on each iteration. The remainder of 10 % 4 is 2, so those last two elements will be set in a single iteration: ```plaintext storing 4 els at pos 0 storing 4 els at pos 4 storing 2 els at pos 8 [0, 0, 0, 0, 4, 4, 4, 4, 8, 8] ``` If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a separate iteration for each element. However passing `size` as a parameter also allows the loop for the remaining elements to be unrolled. You can also unroll the main loop to potentially improve performance at the cost of binary size: ``` vectorize[closure, width, size=size, unroll_factor=2]() ``` In the generated assembly the function calls will be repeated, resulting in fewer arithmetic, comparison, and conditional jump operations. The assembly would look like this in pseudocode: ``` closure[4](0) closure[4](4) closure[2](8) ``` **Parameters:** * ​origins (`origin.set`): The capture origins. * ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body. * ​simd\_width (`Int`): The SIMD vector width. * ​size (`Int`): The upper limit for the loop. * ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1). --- ## Vendor `@register_passable` `struct Vendor` Represents GPU vendors. This struct provides identifiers for different GPU vendors and utility methods for comparison and string representation. The Vendor struct defines constants for common GPU vendors (NVIDIA, AMD) and includes a NO\_GPU option for systems without GPU support. It provides comparison operators and string conversion methods for vendor identification. ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writable` ## Aliases ### `AMD_GPU` `alias AMD_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](1))` Represents AMD GPU vendor. ### `NO_GPU` `alias NO_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](0))` Represents no GPU or CPU-only execution. ### `NVIDIA_GPU` `alias NVIDIA_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](2))` Represents NVIDIA GPU vendor. ## Methods ### `__eq__` `__eq__(self, other: Self) -> Bool` Checks if two `Vendor` instances are equal. **Args:** * ​other (`Self`): The `Vendor` to compare with. **Returns:** True if vendors are equal, False otherwise. ### `__ne__` `__ne__(self, other: Self) -> Bool` Checks if two `Vendor` instances are not equal. **Args:** * ​other (`Self`): The `Vendor` to compare with. **Returns:** True if vendors are not equal, False otherwise. ### `__is__` `__is__(self, other: Self) -> Bool` Identity comparison for vendors. **Args:** * ​other (`Self`): The `Vendor` to compare with. **Returns:** True if vendors are identical, False otherwise. ### `__isnot__` `__isnot__(self, other: Self) -> Bool` Negative identity comparison for vendors. **Args:** * ​other (`Self`): The Vendor to compare with. **Returns:** True if vendors are not identical, False otherwise. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Writes vendor information to a writer. **Parameters:** * ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait. **Args:** * ​writer (`W`): The writer to output vendor information to. ### `__str__` `__str__(self) -> String` Returns a string representation of the vendor. **Returns:** String representation of the vendor. --- ## vendor_blas ## Structs * [​`Backend`](./Backend): * [​`Handle`](./Handle): ## Functions * [​`matmul`](./matmul): Matmul using the vendor BLAS library. With a global handle. --- ## vnni_intrinsics ## Functions * [​`dot_i16_to_i32_AVX2`](./dot_i16_to_i32_AVX2): The dot product of the two words in each int32 element of a and b plus a int32 from src. * [​`dot_i16_to_i32_x86`](./dot_i16_to_i32_x86): The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2. * [​`dot_i8_to_i32_AVX2`](./dot_i8_to_i32_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src. * [​`dot_i8_to_i32_saturated_AVX2`](./dot_i8_to_i32_saturated_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src. * [​`dot_i8_to_i32_saturated_x86`](./dot_i8_to_i32_saturated_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. * [​`dot_i8_to_i32_x86`](./dot_i8_to_i32_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2. * [​`pmaddubs`](./pmaddubs): * [​`pmaddw`](./pmaddw): * [​`vpdpbusd`](./vpdpbusd): * [​`vpdpbusds`](./vpdpbusds): * [​`vpdpwssd`](./vpdpwssd): * [​`vpdpwssds`](./vpdpwssds): --- ## vpdpbusd `vpdpbusd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## vpdpbusds `vpdpbusds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## vpdpwssd `vpdpwssd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## vpdpwssds `vpdpwssds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]` --- ## wait_on_dependent_grids `wait_on_dependent_grids()` Waits for all dependent grids launched by this grid to complete execution. This function blocks the calling grid until all dependent grids that were launched by this grid have completed their execution. It provides a synchronization point between parent and child grids in a multi-grid dependency chain. Note: * Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs. * Must be called by all threads in a thread block to avoid undefined behavior. * Can be used to ensure dependent grid work is complete before proceeding with subsequent operations in the parent grid. --- ## warp GPU warp-level operations and utilities. This module provides warp-level operations for NVIDIA and AMD GPUs, including: * Shuffle operations to exchange values between threads in a warp: * shuffle\_idx: Copy value from source lane to other lanes * shuffle\_up: Copy from lower lane IDs * shuffle\_down: Copy from higher lane IDs * shuffle\_xor: Exchange values in butterfly pattern * Warp-wide reductions: * sum: Compute sum across warp * max: Find maximum value across warp * min: Find minimum value across warp * broadcast: Broadcast value to all lanes The module handles both NVIDIA and AMD GPU architectures through architecture-specific implementations of the core operations. It supports various data types including integers, floats, and half-precision floats, with SIMD vectorization. ## Structs * [​`ReductionMethod`](/mojo/stdlib/gpu/warp/ReductionMethod): Enumerates the supported reduction methods. ## Functions * [​`broadcast`](/mojo/stdlib/gpu/warp/broadcast): Broadcasts a SIMD value from lane 0 to all lanes in the warp. * [​`lane_group_max`](/mojo/stdlib/gpu/warp/lane_group_max): Reduces a SIMD value to its maximum within a lane group using warp-level operations. * [​`lane_group_max_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_max_and_broadcast): Reduces and broadcasts the maximum value within a lane group using warp-level operations. * [​`lane_group_min`](/mojo/stdlib/gpu/warp/lane_group_min): Reduces a SIMD value to its minimum within a lane group using warp-level operations. * [​`lane_group_reduce`](/mojo/stdlib/gpu/warp/lane_group_reduce): Performs a generic warp-level reduction operation using shuffle operations. * [​`lane_group_sum`](/mojo/stdlib/gpu/warp/lane_group_sum): Computes the sum of values across a group of lanes using warp-level operations. * [​`lane_group_sum_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_sum_and_broadcast): Computes the sum across a lane group and broadcasts the result to all lanes. * [​`max`](/mojo/stdlib/gpu/warp/max): Computes the maximum value across all lanes in a warp. * [​`min`](/mojo/stdlib/gpu/warp/min): Computes the minimum value across all lanes in a warp. * [​`prefix_sum`](/mojo/stdlib/gpu/warp/prefix_sum): Computes a warp-level prefix sum (scan) operation. * [​`reduce`](/mojo/stdlib/gpu/warp/reduce): Performs a generic warp-wide reduction operation using shuffle operations. * [​`shuffle_down`](/mojo/stdlib/gpu/warp/shuffle_down): Copies values from threads with higher lane IDs in the warp. * [​`shuffle_idx`](/mojo/stdlib/gpu/warp/shuffle_idx): Copies a value from a source lane to other lanes in a warp. * [​`shuffle_up`](/mojo/stdlib/gpu/warp/shuffle_up): Copies values from threads with lower lane IDs in the warp. * [​`shuffle_xor`](/mojo/stdlib/gpu/warp/shuffle_xor): Exchanges values between threads in a warp using a butterfly pattern. * [​`sum`](/mojo/stdlib/gpu/warp/sum): Computes the sum of values across all lanes in a warp. --- ## Warp In GPU programming, a warp is a subset of [threads](thread.mdx) from a [thread block](thread-block.mdx) that execute together in lockstep. When a GPU assigns a thread block to execute on a [streaming multiprocessor](streaming-multiprocessor.mdx) (SM), the SM divides the thread block into warps of 32 or 64 threads, with the exact size depending on the GPU architecture. If a thread block contains a number of threads not evenly divisible by the warp size, the SM creates a partially filled final warp that still consumes the full warp's resources. For example, if a thread block has 100 threads and the warp size is 32, the SM creates: - 3 full warps of 32 threads each (96 threads total) - 1 partial warp with only 4 active threads but still occupying a full warp's worth of resources (32 thread slots) The SM effectively disables the unused thread slots in partial warps, but these slots still consume hardware resources. For this reason, developers generally should make thread block sizes a multiple of the warp size to optimize resource usage. Each thread in a warp executes the same instruction at the same time on different data, following the single instruction, multiple threads (SIMT) execution model. If threads within a warp take different execution paths (called *warp divergence*), the warp serially executes each branch path taken, disabling threads that are not on that path. This means that optimal performance is achieved when all threads in a warp follow the same execution path. An SM can actively manage multiple warps from different thread blocks simultaneously, helping keep execution units busy. For example, the warp scheduler can quickly switch to another ready warp if the current warp's threads must wait for memory access. Warps deliver several key performance advantages: - The hardware needs to manage only warps instead of individual threads, reducing scheduling overhead - Threads in a warp can access contiguous memory locations efficiently through memory coalescing - The hardware automatically synchronizes threads within a warp, eliminating the need for explicit synchronization - The warp scheduler can hide memory latency by switching between warps, maximizing compute resource utilization --- ## warp_id `warp_id() -> UInt` Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block. **Returns:** The warp ID (0 to BLOCK\_SIZE/WARP\_SIZE-1) of the current thread. --- ## warp_specialize_gemm_with_multicasting `warp_specialize_gemm_with_multicasting[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape], grid_shape: OptionalReg[IndexList[2]] = OptionalReg[IndexList[2]]({:i1 0, 1}), use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1))](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)` --- ## warp_specialized_gemm_output `warp_specialized_gemm_output[c_type: DType, accum_type: DType, c_layout: Layout, c_smem_layout: Layout, c_tma_layout: Layout, c_reg_layout: Layout, c_desc_layout: Layout, /, *, c_tile_shape: IndexList[2], c_swizzle: TensorMapSwizzle, wgmma_shape: IndexList[3], num_consumer: Int = 1, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_smem_tile: LayoutTensor[c_type, c_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5)], warp_group_thread_idx: UInt, local_warp_group_idx: UInt, local_thread_idx: UInt, block_y: Int, block_x: Int)` --- ## warpgroup_reg_alloc `warpgroup_reg_alloc[count: Int]()` Allocates additional registers for the executing warp group. Hints to the system to increase per-thread registers owned by the executing warp. Requests additional registers to increase the absolute per-thread maximum register count from its current value to the specified count. Note: * Only supported on NVIDIA SM90+ GPUs * Performance optimization hint that may be ignored by the hardware * Pair with \`warpgroup\_reg\_dealloc() when extra registers are no longer needed **Parameters:** * ​count (`Int`): The desired number of registers per thread. Must be: * A multiple of 8 * Between 24 and 256 (inclusive). --- ## warpgroup_reg_dealloc `warpgroup_reg_dealloc[count: Int]()` Deallocates additional registers for the executing warp group. Hints to the system to decrease per-thread registers owned by the executing warp. Releases extra registers to reduce the absolute per-thread maximum register count from its current value to the specified count. Note: * Only supported on NVIDIA SM90+ GPUs. * Performance optimization hint that may be ignored by the hardware. * Pair with `warpgroup_reg_alloc()` when extra registers are needed. **Parameters:** * ​count (`Int`): The desired number of registers per thread. Must be: * A multiple of 8. * Between 24 and 256 (inclusive). --- ## weakly_compatible `weakly_compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if shape A is weakly compatible with shape B. A shape A is weakly compatible with shape B if there exists a shape C congruent to A such that compatible(elem\_scale(A,C), B). This establishes a partial order relation between shapes where A a (`IntTuple[origin]`): The first `IntTuple` to compare. * ​b (`IntTuple[origin]`): The second `IntTuple` to compare. **Returns:** True if shape A is weakly compatible with shape B, False otherwise. --- ## weakly_congruent `weakly_congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool` Test if two IntTuples have similar hierarchical structures. This function establishes a partial order relation between IntTuples based on their hierarchical structure. It's less strict than congruent. **Args:** * ​a (`IntTuple[origin]`): First IntTuple to compare. * ​b (`IntTuple[origin]`): Second IntTuple to compare. **Returns:** True if a's structure is compatible with b's structure, False otherwise. --- ## Weight ## `Weight` {#max.graph.Weight} > *class* max.graph.Weight(\*args, \*\*kwargs) Bases: [`TensorValue`](TensorValue.md#max.graph.TensorValue) Represents a value in a Graph that can be loaded at a later time. Weights can be initialized outside of a Graph and are lazily-added to the parent graph when used. If there is no parent graph when a weight is used, an error will be raised. Value is abstract, it shouldn’t be constructed directly. ### `align` {#max.graph.Weight.align} > align\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* ### `device` {#max.graph.Weight.device} > *property* device\*: DeviceRef\* Returns the device of the TensorValue. ### `dtype` {#max.graph.Weight.dtype} > *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\* Returns the tensor data type. The following example demonstrates how to access the data type of a tensor: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("dtype_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Access tensor data type print(f"Data type: {tensor.dtype}") # Output: DType.float32 ``` ### `original_dtype_and_shape` {#max.graph.Weight.original_dtype_and_shape} > *property* original\_dtype\_and\_shape\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[DType](../dtype.md#max.dtype.DType), [Shape](type.md#max.graph.type.Shape)]\* The original dtype and shape of this weight. This property should be used to store the original weight’s dtype and shape the quantization encoding forces the weight to be loaded as uint8. ### `quantization_encoding` {#max.graph.Weight.quantization_encoding} > quantization\_encoding\*: [QuantizationEncoding](quantization.md#max.graph.quantization.QuantizationEncoding) | [None](https://docs.python.org/3/library/constants.html#None)\* ### `set_sharding_strategy()` {#max.graph.Weight.set_sharding_strategy} > set\_sharding\_strategy(sharding\_strategy) Set the weight sharding strategy. **Parameters:** **sharding\_strategy** (`ShardingStrategy` ) – A callable that takes the host weight and shard index, and returns the sharded value. **Return type:** None ### `shape` {#max.graph.Weight.shape} > *property* shape\*: [Shape](type.md#max.graph.type.Shape)\* Returns the shape of the [`TensorValue`](TensorValue.md#max.graph.TensorValue). The following example demonstrates how to access the shape of a tensor: ```python import numpy as np from max.dtype import DType from max.graph import Graph, ops # Create a 2x2 matrix matrix = np.array([[1, 2], [3, 4]], dtype=np.float32) # Create a Graph context to work with tensors with Graph("shape_demo") as graph: # Create a constant tensor from the matrix tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU()) # Access tensor shape print(f"Shape: {tensor.shape}") # Shape: [Dim(2), Dim(2)] ``` ### `shard()` {#max.graph.Weight.shard} > shard(shard\_idx, device) Gets a specific shard from the Weight. This Weight must have sharding\_strategy defined. The shard object returned is also a Weight object, but cannot be sharded further. **Parameters:** * **shard\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – int value of the shard. * **device** (`DeviceRef` ) – device to place the shard. **Returns:** The sharded weight. **Return type:** [*Weight*](#max.graph.Weight) ### `shard_idx` {#max.graph.Weight.shard_idx} > shard\_idx\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* ### `sharding_strategy` {#max.graph.Weight.sharding_strategy} > sharding\_strategy\*: \_ShardingStrategyContainer | [None](https://docs.python.org/3/library/constants.html#None)\* --- ## Weighted2DPoint `@register_passable(trivial)` `struct Weighted2DPoint[type: DType]` Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation. ## Fields * ​y (`Int`): * ​x (`Int`): * ​w (`SIMD[type, 1]`): ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `__init__(y: Int, x: Int, weight: SIMD[type, 1]) -> Self` --- ## welford_block_all_reduce `welford_block_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## welford_combine `welford_combine[type: DType, //](mean: SIMD[type, 1], m2: SIMD[type, 1], count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## welford_update `welford_update[type: DType, //](val: SIMD[type, 1], mut mean: SIMD[type, 1], mut m2: SIMD[type, 1], mut count: SIMD[type, 1])` --- ## welford_warp_all_reduce `welford_warp_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## welford_warp_reduce `welford_warp_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])` --- ## wgmma_async `wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: StaticTuple[SIMD[c_dtype, 1], width]) -> StaticTuple[SIMD[c_dtype, 1], width]` Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. This function executes an asynchronous matrix multiplication using warp group MMA instructions. It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8. **Constraints:** * The number of output registers must match the instruction shape: `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`. * Data type combinations must be compatible with hardware WGMMA instructions. **Parameters:** * ​m (`Int`): Number of rows in matrix A and output matrix. * ​n (`Int`): Number of columns in matrix B and output matrix. * ​k (`Int`): Number of columns in matrix A / rows in matrix B. * ​c\_dtype (`DType`): Data type of the output matrix C. * ​width (`Int`): Width of the InlineArray register for matrix C. * ​a\_type (`DType`): Data type of matrix A. * ​b\_type (`DType`): Data type of matrix B. * ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype). * ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col"). * ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col"). * ​scale\_d (`Int`): Scale factor for matrix C. * ​scale\_a (`Int`): Scale factor for matrix A. * ​scale\_b (`Int`): Scale factor for matrix B. **Args:** * ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A. * ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B. * ​c\_reg (`StaticTuple[SIMD[c_dtype, 1], width]`): StaticTuple containing matrix C values. **Returns:** `StaticTuple` containing the result of the matrix multiplication. `wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: SIMD[c_dtype, width]) -> SIMD[c_dtype, width]` Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. This function executes an asynchronous matrix multiplication using warp group MMA instructions. It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8. **Constraints:** * The number of output registers must match the instruction shape: `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`. * Data type combinations must be compatible with hardware WGMMA instructions. **Parameters:** * ​m (`Int`): Number of rows in matrix A and output matrix. * ​n (`Int`): Number of columns in matrix B and output matrix. * ​k (`Int`): Number of columns in matrix A / rows in matrix B. * ​c\_dtype (`DType`): Data type of the output matrix C. * ​width (`Int`): Width of the SIMD register for matrix C. * ​a\_type (`DType`): Data type of matrix A. * ​b\_type (`DType`): Data type of matrix B. * ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype). * ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col"). * ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col"). * ​scale\_d (`Int`): Scale factor for matrix C. * ​scale\_a (`Int`): Scale factor for matrix A. * ​scale\_b (`Int`): Scale factor for matrix B. **Args:** * ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A. * ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B. * ​c\_reg (`SIMD[c_dtype, width]`): SIMD register containing matrix C values. **Returns:** SIMD register containing the result of the matrix multiplication. `wgmma_async[m: Int, n: Int, k: Int, a_dtype: DType, c_dtype: DType, frag_a_width: Int, frag_c_width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_frag: SIMD[a_dtype, frag_a_width], mat_b_desc: WGMMADescriptor[dtype], c: SIMD[c_dtype, frag_c_width]) -> SIMD[c_dtype, frag_c_width]` Performs warp group async Matrix-multiply and accumulate (WGMMA) operation. Currently only supports: * m=64, k=16. * BF16 input types. * FP32 accumulation. * Row major matrix A. * Column major matrix B (or row major for BF16). **Parameters:** * ​m (`Int`): Number of rows in output matrix. * ​n (`Int`): Number of columns in output matrix. * ​k (`Int`): Inner dimension for matrix multiplication. * ​a\_dtype (`DType`): Data type of matrix A fragment. * ​c\_dtype (`DType`): Data type of output matrix C. * ​frag\_a\_width (`Int`): Width of matrix A fragment. * ​frag\_c\_width (`Int`): Width of output matrix C fragment. * ​a\_type (`DType`): Data type of matrix A. * ​b\_type (`DType`): Data type of matrix B. * ​accum\_type (`DType`): Data type used for accumulation (defaults to c\_dtype). * ​layout\_a (`StringSlice[StaticConstantOrigin]`): Layout of matrix A ("row" or "col", defaults to "row"). * ​layout\_b (`StringSlice[StaticConstantOrigin]`): Layout of matrix B ("row" or "col", defaults to "col"). * ​scale\_d (`Int`): Scale factor for output matrix C (defaults to 1). * ​scale\_a (`Int`): Scale factor for matrix A (defaults to 1). * ​scale\_b (`Int`): Scale factor for matrix B (defaults to 1). **Args:** * ​mat\_a\_frag (`SIMD[a_dtype, frag_a_width]`): Fragment containing matrix A data. * ​mat\_b\_desc (`WGMMADescriptor[dtype]`): Descriptor for matrix B data. * ​c (`SIMD[c_dtype, frag_c_width]`): Fragment containing matrix C data. **Returns:** Updated matrix C fragment after WGMMA operation. --- ## wgmma_c_layout `wgmma_c_layout[mma_m: Int, mma_n: Int, C: Layout]() -> List[Layout]` Generates three layouts for mapping WGMMA C matrix coordinates. This function creates three layout mappings that are essential for working with WGMMA (Warp Group Matrix Multiply-Accumulate) operations: 1. A projection layout that maps linearized indices to row coordinates (i) 2. A projection layout that maps linearized indices to column coordinates (j) 3. A composite layout that maps thread and vector coordinates to linearized indices across multiple MMA tiles These layouts are particularly useful for operations like attention masking and matrix multiplication epilogues, where register values need to be mapped to the coordinate system of the C matrix. Note: This function enforces constraints on the WGMMA dimensions and ensures the C matrix dimensions are compatible with the WGMMA instruction size. **Parameters:** * ​mma\_m (`Int`): The M dimension (rows) of a single WGMMA instruction, must be 64. * ​mma\_n (`Int`): The N dimension (columns) of a single WGMMA instruction, must be multiple of 8. * ​C (`Layout`): The layout of the C matrix within a thread block. **Returns:** `List[Layout]` - A list containing three layouts: 1. proj\_i: Maps linearized indices to row coordinates 2. proj\_j: Maps linearized indices to column coordinates 3. TV\_tile\_to\_idx: Maps thread/vector/tile coordinates to linearized indices --- ## wgmma_c_thread_layout `wgmma_c_thread_layout[C: Layout]() -> Layout` Returns the thread layout component for WGMMA C matrix. Generates the first mode of the WGMMA C layout, which maps thread coordinates to linearized indices in the output matrix. **Parameters:** * ​C (`Layout`): The layout of the C matrix. **Returns:** `Layout` - A layout mapping thread coordinates to linearized indices. --- ## wgmma_commit_group_sync `wgmma_commit_group_sync()` Commits pending warp group matrix multiply operations. This synchronizes the warp group and ensures all WGMMA operations have been committed. Must be called after a sequence of WGMMA operations before accessing results. --- ## wgmma_fence_aligned `wgmma_fence_aligned()` Inserts a memory fence for warp group matrix multiply operations. This ensures all prior shared memory accesses are visible before subsequent WGMMA operations. Must be called before starting a new sequence of WGMMA operations. --- ## wgmma_output_layout `wgmma_output_layout[mma_n: Int, C: Layout]() -> Layout` Returns the output layout component for WGMMA C matrix. Generates the second mode of the WGMMA C layout, which maps output vector coordinates to linearized indices in the output matrix. **Parameters:** * ​mma\_n (`Int`): The N dimension of the WGMMA instruction. * ​C (`Layout`): The layout of the C matrix. **Returns:** `Layout` - A layout mapping output vector coordinates to linearized indices. --- ## wgmma_wait_group_sync `wgmma_wait_group_sync[group: Int = 0]()` Waits for all pending warp group matrix multiply operations to complete. This synchronizes the warp group and ensures all WGMMA operations have finished executing. Must be called after commit and before accessing results. **Parameters:** * ​group (`Int`): The number of pending wgmma-groups to wait until. --- ## WGMMADescriptor `@register_passable(trivial)` `struct WGMMADescriptor[dtype: DType]` Descriptor for shared memory operands used in warp group matrix multiply operations. This struct represents a descriptor that encodes information about shared memory layout and access patterns for warp group matrix multiply operations. The descriptor contains the following bit fields: * Start address (14 bits): Base address in shared memory. * Leading byte offset (14 bits): Leading dimension stride in bytes. * Stride byte offset (14 bits): Stride dimension offset in bytes. * Base offset (3 bits): Additional offset. * Swizzle mode (2 bits): Memory access pattern. The bit layout is: +----------+----+------------+----+------------+----+-----+----------+-----+ \| 0-13 |14-15| 16-29 |30-31| 32-45 |46-48|49-51| 52-61 |62-63| +----------+----+------------+----+------------+----+-----+----------+-----+ \| 14bits |2bits| 14bits |2bits| 14bits |2bits|3bits| 10bits |2bits| +----------+----+------------+----+------------+----+-----+----------+-----+ \| BaseAddr | 0 |LeadingDim | 0 | Stride | 0 |Offst| 0 |Swzle| +----------+----+------------+----+------------+----+-----+----------+-----+ See: ## Parameters * ​dtype (`DType`): The data type of the shared memory operand. This affects memory alignment and access patterns for the descriptor. ## Fields * ​desc (`SIMD[int64, 1]`): The 64-bit descriptor value that encodes shared memory layout information. This field stores the complete descriptor with all bit fields packed into a single 64-bit integer: * Bits 0-13: Base address in shared memory (14 bits) * Bits 16-29: Leading dimension stride in bytes (14 bits) * Bits 32-45: Stride dimension offset in bytes (14 bits) * Bits 49-51: Base offset (3 bits) * Bits 62-63: Swizzle mode for memory access pattern (2 bits) The descriptor is used by NVIDIA Hopper architecture's warp group matrix multiply instructions to efficiently access shared memory with the appropriate layout and access patterns. ## Implemented traits `AnyType`, `Copyable`, `Movable`, `UnknownDestructibility` ## Methods ### `__init__` `@implicit` `__init__(val: SIMD[int64, 1]) -> Self` Initialize descriptor with raw 64-bit value. This constructor allows creating a descriptor directly from a 64-bit integer that already contains the properly formatted bit fields for the descriptor. The implicit attribute enables automatic conversion from `Int64` to `WGMMADescriptor`. **Args:** * ​val (`SIMD[int64, 1]`): A 64-bit integer containing the complete descriptor bit layout. ### `__add__` `__add__(self, offset: Int) -> Self` Add offset to descriptor's base address. **Args:** * ​offset (`Int`): Byte offset to add to base address. **Returns:** New descriptor with updated base address. ### `__iadd__` `__iadd__(mut self, offset: Int)` Add offset to descriptor's base address in-place. **Args:** * ​offset (`Int`): Byte offset to add to base address. ### `create` `static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]) -> Self` Create a descriptor for shared memory operand. **Parameters:** * ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes. * ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes. * ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode. **Args:** * ​smem\_ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory operand. **Returns:** Initialized descriptor for the shared memory operand. --- ## What is Modular import { Button } from '@mantine/core'; import DocLink from '@site/src/components/DocLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; import ContactSection from '@site/src/components/ContactSection'; The Modular Platform is an open and fully-integrated suite of AI libraries and tools that accelerates model serving and scales GenAI deployments. It abstracts away hardware complexity so you can run the most popular open models with industry-leading GPU and CPU performance without any code changes. Our ready-to-deploy Docker container removes the complexity of deploying your own GenAI endpoint. And unlike other serving solutions, Modular enables customization across the entire stack. You can customize everything from the serving pipeline and model architecture all the way down to the metal by writing custom ops and GPU kernels in Mojo. Most importantly, Modular is hardware-agnostic and free from vendor lock-in—no CUDA required—so your code runs seamlessly across diverse systems. It takes only a moment to start an OpenAI-compatible endpoint with a model from Hugging Face: ```sh max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF ``` ```sh docker run --gpus=1 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ docker.modular.com/modular/max-nvidia-full:latest \ --model-path=modularai/Llama-3.1-8B-Instruct-GGUF ``` ```python from openai import OpenAI client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="EMPTY") completion = client.chat.completions.create( model="modularai/Llama-3.1-8B-Instruct-GGUF", messages=[ { "role": "user", "content": "Write a one-sentence bedtime story about a unicorn.", }, ], ) print(completion.choices[0].message.content) ``` Try it now ## Capabilities - [x] **High-performance, portable serving**: Serve 500+ AI models from Hugging Face using our OpenAI-compatible REST API with industry-leading performance across GPUs and CPUs. - [x] **Large-scale, GenAI deployment**: Scale massive GenAI inference services across thousands of GPU nodes. Modular intelligently routes workloads across models and hardware types to maximize throughput and minimize latency. - [x] **Flexible, faster development**: Deploy with a single Docker container that's under 1GB across multiple hardware types, compile in seconds rather than hours, and develop faster with a slim toolchain that makes versioning and dependency nightmares disappear. - [x] **Customize everywhere**: Customize at any layer of the stack by writing hardware-agnostic GPU and CPU kernels, porting models into Modular's optimized graph format, or programming hardware directly with Mojo (no hardware-specific libraries). ## Components Modular is a vertically integrated AI infrastructure stack that spans from the hardware all the way up to Kubernetes, and it provides entry points for users at every level. Figure 1. A simplified diagram of how the Modular Platform scales your GenAI deployment. - 🏔️ **Mammoth**: A Kubernetes-native control plane, router, and substrate specially-designed for large-scale distributed AI serving. It supports multi-model management, prefill-aware routing, disaggregated compute and cache, and other advanced AI optimizations. - 🧑🏻‍🚀 **MAX**: A high-performance AI serving framework that includes advanced model serving optimizations like speculative decoding, and graph compiler optimizations like op-level fusions. It provides an OpenAI-compatible serving endpoint, executes native MAX and PyTorch models across GPUs and CPUs, and can be customized at the model and kernel level. - 🔥 **Mojo**: A kernel-focused systems programming language that enables high-performance GPU and CPU programming, blending Pythonic syntax with the performance of C/C++ and the safety of Rust. All the kernels in MAX are written with Mojo and it can be used to extend MAX Models with novel algorithms. ## Get started You can create an OpenAI-compatible REST endpoint using our `max` CLI or our Docker container: - [**Start with pip**](/max/get-started): Install MAX with `pip` and run inference with Python or a REST endpoint. - [**Start with Docker**](/max/container): Run our Docker container to create a REST endpoint. In either case, you can select from hundreds of GenAI models in our [Model repository](https://builds.modular.com/?category=models). You can also load weights from Hugging Face or load your own fine-tuned weights. For performance optimization, you can port models from PyTorch to MAX using the [MAX Graph API](/max/tutorials/get-started-with-max-graph-in-python). For deeper customization, you can extend MAX Models with [custom operations](/max/tutorials/build-custom-ops) (ops) written in Mojo. Your custom ops are automatically analyzed and fused into the model graph, delivering low-level acceleration without sacrificing developer productivity. :::note Get early access Mammoth is not yet generally available, but enterprise customers can get early access. [Contact us now](https://www.modular.com/company/talk-to-us) ::: ## Stay in touch --- ## What's new Here's everything you should know about what's changed in each release. ## v25.4 nightly This version is still a work in progress. See how to [install the nightly release](/max/packages#nightly-release). ### Documentation {#25-4-docs} * Added instructions on profiling MAX kernels (see `max/kernels/README.md`). ### MAX models {#25-4-models} * GGUF quantized Llamas (q4\_0, q4\_k, and q6\_k) are now supported with paged KVCache strategy. ### MAX framework {#25-4-max} #### Serving & inference engine {#25-4-max-serving} * Inflight batching no longer requires chunked prefill. * The naive KVCache has been deleted. * Removed support for torchscript and torch MLIR models * Continuous KVCache strategy is deprecated. Please use Paged KVCache strategy instead. #### `max` CLI {#25-4-max-cli} * Added `--use-subgraphs` flag to `max generate` to allow for the use of subgraphs in the model. #### Python APIs {#25-4-max-python} * Added `add_subgraph` method to `Graph` class. This method allows for the addition of a subgraph to a graph. * Added the `call` operation which allows for the execution of a subgraph. * Added `fold` op for combining sliding blocks into a larger tensor. * Removed server setting from `llm.py` entrypoint for offline inference. Server is now automatically configured in background without consuming an HTTP port. * Added a `strict` parameter to the `load_state_dict` method in `max.nn.Module`. When `strict=True` (default), an error is raised if the `state_dict` contains unused keys. When `strict=False`, extra keys are ignored. This helps model developers identify missing implementations in their models. * Added the new `max.torch` module for using custom Mojo kernels from PyTorch. This module replaces the previously deprecated `max.torch` module. For example, a custom `grayscale` operation ```mojo @register("grayscale") struct Grayscale: @staticmethod fn execute[ # The kind of device this is running on: "cpu" or "gpu" target: StaticString, ]( img_out: OutputTensor[type = DType.uint8, rank=2], img_in: InputTensor[type = DType.uint8, rank=3], ctx: DeviceContextPtr, ) raises: ... ``` can be used from PyTorch like so: ```python from max.torch import CustomOpLibrary op_library = CustomOpLibrary("path/to/custom.mojopkg") @torch.compile(backend=backend) def grayscale(pic): result = pic.new_empty(pic.shape[:-1]) op_library.grayscale(result, pic) return result img = (torch.rand(64, 64, 3) * 255).to(torch.uint8) result = grayscale(img) ``` See [whisper.py](https://github.com/modularml/modular/blob/main/open-source/max/examples/custom_ops/whisper.py) for a larger example which replaces the attention module with one using a custom fused attention operation implemented in Mojo. * Removed `graph.unique_symoblic_dim`. * `ops.masked_scatter` now requires naming the `out_dim` explicitly as it is data-dependent, eg. ```python ops.masked_scatter( inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs" ) ``` * Removed `max_to_torch_type` and `torch_to_max_type` and replaced them with `DType.to_torch` and `DType.from_torch`, respectively. This aligns with the corresponding NumPy methods. #### Mojo APIs {#25-4-max-mojo} * Mojo Graph, Driver, and Engine APIs have been open sourced and removed from codebase. In addition to this, many types from the `max.tensor` package have been removed: * `Tensor` * `TensorShape` * `TensorSpec` Please replace usage with `LayoutTensor.` * `LayoutTensor` now has a `size` method to get the total number of elements. * `List`, `InlineArray`, `IntTuple`, and `IndexList` now work with list literals. #### Custom ops {#25-4-custom-ops} * Improve error messages when custom op parameters are provided with values that don't have the proper type. ### Mojo language {#25-4-mojo} * Various packages pertaining to the MAX Kernel Library are now shipped in the nightly release. You can now find `mojopkg` in the SDK for the following new packages: * `linalg` * `nn` * `nvml` * `quantization` * `weights_registry` ## v25.3 (2025-05-06) * [Highlights](#25-3-highlights) * [Documentation](#25-3-docs) * [`max` CLI](#25-3-max-cli) * [MAX models](#25-3-models) * [MAX Serve](#25-3-serve) * [MAX Engine & Graph](#25-3-engine) * [Python API](#25-3-engine-mojo-api) * [Mojo API](#25-3-engine-mojo-api) * [Custom ops](#25-3-custom-ops) * [Kernels](#25-3-kernels) * [GPU programming](#25-3-gpu-programming) * [Mojo language](#25-3-mojo) ### ✨ Highlights {#25-3-highlights} * You can now **install Modular APIs and tools with pip**: ```sh pip install modular \ --index-url https://download.pytorch.org/whl/cpu ``` This installs the `max` CLI, `max` Python library, `mojo` CLI, and Mojo libraries. However, the Mojo LSP and debugger are currently not included. If you plan to develop with Mojo, we still suggest using [`magic`](/magic). We use the `--index-url` argument to ensure that `torch` installs its CPU dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a temporary workaround until we can remove our dependency on `torch`. * We **open-sourced the MAX AI kernels** and the rest of the **Mojo standard library**! The [MAX AI kernels library](/mojo/lib#max-ai-kernels-library) is a new Mojo API for writing high-performance and portable programs across CPU and GPU, but it's also [the source code for our CPU/GPU kernels](https://github.com/modular/modular/tree/main/max/kernels/src). You can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and GPUs. Just like the Mojo standard library, these kernels are open source under the Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard library is also [now open source in GitHub](https://github.com/modular/modular/tree/main/mojo/stdlib/src). * **Learn to program GPUs** with [Mojo GPU Puzzles](https://builds.modular.com/puzzles)! This is a brand new site that offers a hands-on guide to mastering GPU programming with Mojo. Starting from basic concepts, you'll learn step-by-step how to program for GPUs by solving increasingly challenging puzzles. ### Documentation {#25-3-docs} We've restructured the documentation to unify MAX and Mojo documentation under the Modular Platform. We believe this improves content discovery with a simplified navigation and helps unify the platform story as a whole. We've also added the following new docs: * [REST API reference](/max/api/serve): Although it's not a new API (our serving library has supported OpenAI APIs for the last few versions), this now shows precisely which endpoints and body parameters we support. * [Speculative decoding](/max/serve/speculative-decoding): An introduction to using speculative decoding to reduce latency for LLMs. This feature is still in development. * [Offline inference](/max/serve/offline-inference): An introduction to our Python API for running inference with an LLM locally (without sending requests to a serving endpoint). * [Introduction to layouts](/mojo/manual/layout/layouts): A guide to working with dense multidimensional arrays on CPUs and GPUs, using new Mojo `layout` types that abstract-away complex memory layout patterns. ### `max` CLI {#25-3-max-cli} * Renamed the `max-pipelines` CLI tool to `max`. We recommend re-installing it as shown in the [`max` CLI docs](/max/max-cli/). * Remove previously deprecated `--use-gpu`, `--serialized_model_path`, `--save_to_serialized_model_path`, `--max_cache_batch_size` and `--huggingface-repo-id` options. * Move `InputContext`, `TextContext`, and `TextAndVisionContext` from `max.pipelines` to `max.pipelines.context`. ### MAX models {#25-3-models} * Added `Llama4ForConditionalGeneration` support, featuring new MoE layers. Currently, it is limited to text inputs. Run the model by calling: ```sh max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3 ``` * Added support for running text generations using the Mistral 3 24B model. Run the model with: ```sh max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0 ``` * Fixed empty textual outputs for certain Mistral models ([MAX issue 4193](https://github.com/modular/modular/issues/4193)). * Added support for loading a custom pipeline architecture by module. Using `--custom-architectures=folder/path/to/import:my_module` will lead to loading architectures from the file. The architectures must be exposed via an `ARCHITECTURES` variable in the file. Once loaded, a model can be run using the new architectures. The flag can be specified multiple times to load more modules. ### MAX Serve {#25-3-serve} * Moved from radix trie to hash based prefix caching implementation which has smaller CPU overheads. This improves performance particularly in workloads with high cache reuse rates. * Added experimental support for offloading KVCache to host memory via the `--enable-kvcache-swapping-to-host` and `--host-kvcache-swap-space-gb` flags. This allows for superior KVCache reuse through prefix caching in workloads where the reusable KVCache amount exceeds GPU VRAM. * Fixed the `usage.prompt_tokens` field in the OpenAI API Usage Info response. Previously this field was always set to Null, but now it correctly contains the number of prompt tokens in the request. * Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies between frontend server process and model worker process related to networking. * Stray model workers on Linux now terminate more reliably when the parent process is killed. ### MAX Engine & Graph {#25-3-engine} #### Python API {#25-3-engine-python-api} * We now raise an error if there's a mismatch between the expected device of a weight on a graph and the device of the actual tensor data specified in [`InferenceSession.load()`](/max/api/python/engine#max.engine.InferenceSession.load). * Removed `output_device` argument from [`Model.execute()`](/max/api/python/engine#max.engine.Model.execute). * Removed the `copy_inputs_to_device` argument in [`Model.execute`](/max/api/python/engine#max.engine.Model.execute) to improve predictability of the API. Now `execute()` raises a `TypeError` if arguments are passed whose devices don't match the model. * Swapped the order of the `dtype` and `shape` fields of [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor). Previously, the arguments are ordered as `(shape, dtype)`. They are now swapped to `(dtype, shape)` to be in line with other tensor-like types. * Replaced some instances of [`Tensor.zeros`](/max/api/python/driver#max.driver.Tensor.zeros) with `Tensor.__init__` when the engine did not depend on the tensor being zero initialized. This elides the unnecessary memset to provide a minor performance improvement. * Added a new experimental [`Tensor.inplace_copy_from()`](/max/api/python/driver#max.driver.Tensor.inplace_copy_from). This allows users to copy the contents of one `Tensor` into another. * Made the default behavior of [`Weight`](/max/api/python/graph/Weight) as expecting the initial allocation on host. A transfer is then inserted to the target device and this value is returned when weights generate an MLIR value. This is done due to current conservative ownership around external weights. * Added the [`irfft`](/max/api/python/graph/ops/#max.graph.ops.irfft) op, which computes the inverse real fast fourier transform (FFT). * Added the [`argmax`](/max/api/python/graph/ops#max.graph.ops.argmax) op, which returns the index of the maximum value in an array or sequence. * Added the [`GroupNorm`](/max/api/python/nn/norm/group_norm) layer. * Switched layer names so that `max.nn` layers that are implemented with the deprecated `Layer` class are marked as "V1", and layers that are implemented with the new [`max.nn.Module`](/max/api/python/nn/layer#max.nn.layer.Module) are the default. That is, `max.nn.LinearV2` is now [`max.nn.Linear`](/max/api/python/nn/linear#max.nn.linear.Linear), and the previous `max.nn.Linear` is now [`max.nn.LinearV1`](/max/api/python/nn/linear#max.nn.linear.LinearV1). * DeviceRefs in types/layers are in general expected to be explicit rather than implicit. #### Mojo API {#25-3-engine-mojo-api} * Removed some functionality from [`tensor.Tensor`](/max/api/mojo/tensor/tensor/Tensor): * Serializing `Tensor` to disk (`Tensor.tofile(path)` and `Tensor.save(path)`). * Reading the serialized data back from disk (`Tensor.load(path)` and `Tensor.fromfile(path)`. * `rand` and `randn` methods have been removed. Use the ones in the Mojo standard library if you still need access for constructing a new `Tensor` with random elements based on a particular `TensorShape`. * **Deprecated the Mojo Driver, Graph, and Engine APIs** These APIs are not currently used internally. Instead, we build graphs using the Python APIs, and our engineering efforts have been focused on making that experience as robust and user-friendly as possible. As a result, the Mojo versions of these APIs have not kept pace with new features and language improvements. These APIs will be open sourced for the community before being removed. #### Custom ops API {#25-3-custom-ops} * You can now pass Mojo source package paths as [`Graph`](/max/api/python/graph/Graph) custom extensions. The Mojo code will be compiled automatically, no need to run `mojo package` manually as a prior step. Previously, only pre-compiled `.mojopkg` paths were accepted, requiring the Mojo code to be built as a prerequisite step before running a `Graph` with a custom op. Given a project structure like: ```text project |-- main.py \-- kernels |-- __init__.mojo \-- my_custom_op.mojo ``` You can construct a `Graph` in `main.py` using Mojo custom op kernels simply using: ```python g = Graph( ..., custom_extensions = [Path(__file__).parent / "kernels"] ) ``` A change to your Mojo source code defining a custom op will be reflected immediately the next time the `Graph` is constructed. * New [image\_pipeline example](https://github.com/modular/modular/tree/main/examples/custom_ops) that demonstrates sequencing custom ops together which modify an image, leaving data on the GPU for each op, before writing it back to CPU and disk. ### Kernels {#25-3-kernels} * More compute overlap is now enabled for Hopper GPUs. This allows finer-grained scheduling of kernel operations by analyzing producer-consumer patterns within a compute kernel. As a result, there is more kernel compute overlap, especially for compute-heavy kernels with data-dependent execution paths. ### GPU programming {#25-3-gpu-programming} * CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be version 550. Requiring these earlier driver versions allows MAX to be more easily deployed on AWS and GCP, since these are the default versions used by those cloud providers. * Added support for programming NVIDIA Jetson Orin GPUs (`sm_87`). Also see the [Mojo changelog of GPU changes](/mojo/changelog#gpu-changes). ### Mojo language {#25-3-mojo} * We recently open-sourced the rest of the Mojo standard library, including the `algorithm`, `benchmark`, `buffer`, `compile`, `complex`, `gpu`, and `layout` packages. [See it all in GitHub](https://github.com/modular/modular/tree/main/mojo/stdlib/src). * We've also open sourced [all our MAX AI kernels](https://github.com/modular/modular/tree/main/max/kernels/src). This new library includes `kv_cache`, `layout`, `linalg`, `nn`, `nvml`, and `quantization`. For all the updates to the Mojo language, standard library, and tools, see the [Mojo changelog](/mojo/changelog). ## v25.2 (2025-03-25) * [Highlights](#25-2-highlights) * [MAX Serve](#25-2-serve) * [MAX models](#25-2-models) * [`max-pipelines` CLI](#25-2-pipelines-cli) * [MAX Engine](#25-2-engine) * [Driver APIs](#25-2-driver) * [Graph APIs](#25-2-graph) * [Custom ops](#25-2-custom-ops) * [Hopper Kernels](#25-2-hopper-kernels) * [GPU programming](#25-2-gpu-programming) * [Mojo](#25-2-mojo) * [Documentation](#25-2-documentation) ### ✨ Highlights {#25-2-highlights} * **Support for NVIDIA Hopper GPUs** MAX has been optimized to run on Hopper GPUs. For more information on MAX and NVIDIA's hardware, see the [MAX container](/max/container#recommended-cloud-instances) documentation. * **Multi-GPU support** MAX uses tensor parallelism to distribute work across multiple GPUs so you can run LLMs like [`Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), even with long context window. * **Expanded library of MAX models** We're rapidly growing our library of base model architectures that MAX can accelerate with MAX Serve (including `Phi3ForCausalLM`, `OlmoForCausalLM`, and `GraniteForCausalLM`). We also now support `GTPQ` for the Llama models. For more information, check out our [MAX model repository](https://builds.modular.com/?category=models). * **Advanced E2E optimizations for long context window** In flight batching, chunked prefill, and copy-on-write optimize the execution for prefix heavy and long context window scenario. * **GPU programming with Mojo** Lots of new APIs are now available to enable both low-level GPU programming and abstracted programming patterns that simplify the code required to write GPU kernels for your AI models. ### MAX Serve {#25-2-serve} * Extended MAX Serve batch scheduling to account for the prefix cache. The scheduler can now create larger batches when many prompt tokens are already cached, improving throughput up to 10% in some benchmarks. * Added support for in-flight batching, allowing token generation requests to be scheduled alongside context encoding requests to reduce inter-token latency. This behavior can be controlled by CLI argument `--enable-in-flight-batch`. * Added support for copy-on-write on KV blocks when using PagedAttention with Prefix Caching. This improves the prefix cache hit rate and prefill performance in some scenarios. * MAX Serve now supports `transformers` v.4.49.0, with a patch to avoid graph breaks when using `torch.compile()` on Llama models. * Added support for recording HTTP traffic out to a file for diagnostics or later replay. ### MAX models {#25-2-models} * Added support for executing `LlamaForCausalLM` architecture models on multiple GPUs. The model uses tensor parallelism automatically when passing multiple device IDs to the `--devices` CLI argument. Try running [`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) on 4 GPUs with the following example: ```sh max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \ --quantization-encoding bfloat16 \ --devices gpu:0,1,2,3 \ --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points." ``` * Added support for the `Phi3ForCausalLM` model architecture (such as [`microsoft/phi-4`](https://huggingface.co/microsoft/phi-4)). For example: ```sh max-pipelines generate \ --model-path microsoft/phi-4 \ --prompt "Write bubble sort in mojo" ``` * Added support for the `OlmoForCausalLM` model architecture (such as [`allenai/OLMo-1B-0724-hf`](https://huggingface.co/allenai/OLMo-1B-0724-hf)). For example: ```sh max-pipelines generate \ --model-path allenai/OLMo-1B-0724-hf \ --prompt "Write bubble sort in mojo" ``` * Added support for the `GraniteForCausalLM` model architecture (such as [`ibm-granite/granite-3.1-8b-instruct`](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)). For example: ```sh max-pipelines generate \ --model-path ibm-granite/granite-3.1-8b-instruct \ --prompt "Write bubble sort in mojo" ``` * Added support for: * [`microsoft/Phi-3.5-mini-instruct`](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) * [`microsoft/phi-4`](https://huggingface.co/microsoft/phi-4) * [`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct) * [`LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct`](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct) * We now support GPTQ quantization for models that run on the GPU. This is handled transparently when the model weights are specified. For example, this runs Llama 3.1 8B using int4-quantized GPTQ weights: ```sh max-pipelines generate \ --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \ --prompt "Why is the sky blue?" \ --max-batch-size 1 \ --max-length 10000 ``` This reduces the total memory consumption of this model from \~16 GB to \~5 GB, allowing the model to fit in the RAM smaller GPUs. * Model weights are now downloaded in parallel. * Added constraints on whitespace during [Structured Output](/max/serve/structured-output). This reduces tokens counts and improves model adherence. * Added jump ahead decoding during Structured Output. This auto-completes tokens when a singular path forward is identified, improving single completion times by up to \~20% for long prompts. * In the event of an unhandled exception, we now use the standard Python traceback format instead of using pretty-printed Rich tracebacks. * We now need to explicitly import `LLM` from [`max.entrypoints.llm`](/max/api/python/entrypoints) rather than the previous `max.entrypoints` import. * The `max.pipelines.dataprocessing.tokenizer` and `max.pipelines.dataprocessing.gguf_utils` modules have been removed. * The previously deprecated `PipelineConfig.architecture` field and its corresponding `--architecture` CLI argument have been removed. ### `max-pipelines` CLI {#25-2-pipelines-cli} * The `--devices` CLI argument now supports a comma-separated list of GPU IDs prefixed with `gpu:` like `--devices=gpu:0,1,2,3`. We no longer support the previous `--devices=gpu-` format. ```sh max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \ --quantization-encoding bfloat16 \ --devices gpu:0,1,2,3 \ --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points." ``` * Removed `--huggingface-repo-id` [PipelineConfig](/max/api/python/pipelines/config/#max.pipelines.config.PipelineConfig) option and CLI argument in favor of `--model-path`. * We consolidated `--model-path` and `-weight-path`. Valid `--weight-path` values now override `--model-path`, which handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the `--weight-path`, we now fall back to the `--model-path`, which you must set explicitly. * Added `--huggingface-revision` option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository. ### MAX Engine {#25-2-engine} * The MAX graph compiler now has kernel caching. This is a significant improvement to our compilation pipeline. Here are some of the highlights: * Up to 28% faster compilation times when making iterative changes to models * Improved caching between different but similar models (up to 27% faster) * Lays foundation for future caching optimizations What does this mean for you? Faster development cycles! When you're working on model pipelines and making changes to the graph, the graph compiler will now intelligently reuse kernels that haven't changed, significantly reducing compilation times. The improvements are particularly noticeable during iterative development, with compilation times dropping from \~80s to \~57s in some cases of compiling Llama3.1-8B for 4 GPUs. Even when compiling different models from the same family (like Llama/Granite variants), you'll see significant speedups on subsequent compilations. ### Driver APIs {#25-2-driver} * Added `Accelerator.can_access(other: Device) -> bool` method to check if one device can directly access memory of another device. * Fixed a bug in `max.driver.tensor.load_max_tensor()` for `bfloat16` dtype, which would cause an error about mmap size being too large. * `max.driver.Tensor.item()` now works on any single-element tensor (previously restricted to rank-0 tensors). * Added [`Device.synchronize()`](/max/api/python/driver#max.driver.Device.synchronize), which ensures all operations on the device complete before returning. * Removed `MojoCallContextPtr` in favor of `DeviceContextPtr`. `MojoCallContextPtr` only contained a `DeviceContextPtr`, so this change directly exposes the `DeviceContextPtr`. Custom ops using `MojoCallContextPtr` now directly take a `DeviceContextPtr` argument: ```mojo @staticmethod fn execute[ type: DType, rank: Int ]( output: OutputTensor[type=type, rank=rank], input: InputTensor[type=type, rank=rank], ctx: MojoCallContextPtr, ): ``` becomes ```mojo @staticmethod fn execute[ type: DType, rank: Int ]( output: OutputTensor[type=type, rank=rank], input: InputTensor[type=type, rank=rank], ctx: DeviceContextPtr, ): ``` * You can now skip compiling a GPU kernel first before enqueueing it, and pass a function directly to `ctx.enqueue_function[func](...)`: ```mojo fn func(): print("Hello from GPU") @register("custom_op") struct CustomOp: @staticmethod fn execute(ctx: DeviceContextPtr) raises: var dev_ctx = ctx.get_device_context() dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1) ``` However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first and pass it to `ctx.enqueue_function` in this scenario: ```mojo var compiled_func = ctx.compile_function[func]() # Multiple kernel launches with the same function/parameters ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1) ``` * Changed `Accelerator` and `CPU` from factory methods that created `Device` objects in Python (which were accelerators and CPUs in the C++ implementation) to actual Python types. This change elevates the `Accelerator` and `CPU` type concepts to Python, making them types rather than methods. This allows type annotations in Python. For example, a list of accelerators used to be defined like this: ```python graph_devices: list[DeviceRef] ``` Now it can be defined like this: ```python graph_devices: list[Accelerator] ``` * Elementwise operations (e.g. `__add__`) have been removed from `Tensor` (that is, `tensor_internal.Tensor`). This `Tensor` type is being phased out; please reduce usage in favor of `LayoutTensor`. ### Graph APIs {#25-2-graph} * The `nn` package is now [`max.nn`](/max/api/python/nn/). * Added [`ops.chunk`](/max/api/python/graph#max.graphs.ops.chunk)) to support chunking tensors along an axis. * Added support for while loops with [`ops.while_loop`](/max/api/python/graph#max.graphs.ops.while_loop). * Added support for conditional execution with [`ops.cond`](/max/api/python/graph#max.graph.ops.cond). * Added axis reduction overloads for [`ops.min`](/max/api/python/graph/ops#max.graph.ops.min) and [`ops.max`](/max/api/python/graph/ops#max.graph.ops.max). For example; `ops.min(tensor, axis=-1)`. * The [`gelu()`](/max/api/python/graph/ops#max.graph.ops.gelu) function now accepts an `approximate` keyword. The keyword controls the `gelu` approximation with `none`, `tanh`, and `fast` approximations accepted. * Removed the `roundeven()` operation from the Python API. The [`round()`](/max/api/python/graph/ops#max.graph.ops.round) operation now has the same behavior as `roundeven()`, so there is no need for both to exist. * Added helpers to create analogous tensors from buffer types and vice versa. * Added `max.nn.Module`, a base class for writing layers and constructing networks of layers (e.g. using `max.nn.Sequential`). Currently, this class supports graph building by ensuring that all weight names are unique and systematically generated. This class also supports managing the weight values with the `module.state_dict()` and `module.load_state_dict()` methods. More functionality and documentation will be added in future releases. ### Custom ops {#25-2-custom-ops} * Changes have been made to the way that custom ops are registered: rather than using the `num_dps_outputs` attribute on `@compiler.register` to specify the number of outputs, that number is now inferred from the signature of the custom operation. Inputs to the operation now use the `InputTensor` type and outputs from the operation use `OutputTensor`, instead of the previous `ManagedTensorSlice` for both. This eliminates the need for a manual `num_dps_outputs` attribute, and makes it safer to work with these inputs and outputs by preventing accidental writes to input tensors. The new interface looks something like the following: ```mojo @compiler.register("add_one_custom") struct AddOneCustom: @staticmethod fn execute[ target: StringLiteral, ]( out: OutputTensor, x: InputTensor[type = out.type, rank = out.rank], ctx: DeviceContextPtr, ) raises: @parameter @always_inline fn elementwise_add_one[ width: Int ](idx: IndexList[x.rank]) -> SIMD[x.type, width]: return x.load[width](idx) + 1 foreach[elementwise_add_one, target=target](out, ctx) ``` * The `foreach` function now `raises` to be able to handle errors within an elementwise calculation. ### Hopper kernels {#25-2-hopper-kernels} State-of-the-Art Kernels in Mojo for H100/H200 GPUs * **Hopper Architecture Matrix Multiplication Kernels**: The implementation achieved performance comparable to NVIDIA's highly optimized cuBLAS library. These kernels take full advantage of the Tensor Cores in Hopper architecture GPUs to accelerate the fundamental matrix multiplication operations that underpin deep learning workloads. * **Multi-GPU AllReduce Implementation**: The AllReduce operation is critical for distributed inference across multiple GPUs, as it efficiently aggregates gradients. The Mojo implementation surpassed NVIDIA's NCCL library in performance benchmarks. This improvement reduces communication overhead during distributed inference. * **MAX Attention Kernel with Flash Attention 3:** This implementation incorporates the latest Flash Attention 3 algorithm and extends it, which significantly accelerates the computation of attention mechanisms in transformer models. The MAX attention kernel optimizes memory access patterns and computational steps, reducing both the memory footprint and execution time of attention operations. This is particularly important for LLMs where attention calculations represent a substantial portion of the computational workload. ### GPU programming {#25-2-gpu-programming} * Added the [Mojo `max.driver` API](/max/api/mojo/driver) to enable dispatching GPU functions from Mojo. Check out [examples for GPU programming in Mojo](https://github.com/modular/modular/tree/main/examples/gpu_functions), which use this new API. ### Mojo {#25-2-mojo} Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the [Mojo changelog](/mojo/changelog). ### Documentation {#25-2-documentation} New examples for writing custom ops: * [`fused_attention`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/fused_attention.mojo) demonstrates complex GPU programming using MAX abstractions for a practical use in AI model development. * [`matrix_multiplication`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/matrix_multiplication.mojo) includes a series of progressive optimizations for matrix multiplications on GPUs. * [`histogram`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/histogram.mojo) shows how to implement the histogram pattern as a custom op. * New [examples for GPU programming in Mojo](https://github.com/modular/modular/tree/main/examples/gpu_functions) using the new [MAX Driver API](/max/api/mojo/driver/) These use a Mojo programming model that should look familiar to CUDA C programmers, showing how to define and dispatch GPU functions within a single Mojo file. These examples recreate the first three samples from the popular textbook ["Programming Massively Parallel Processors"](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311), showing how basic concepts translate from CUDA into Mojo. Additionally, a Mandelbrot set calculation example that parallels a similar one in the existing custom ops examples. * New [MAX containers](/max/container/) available. For more information on the base and full MAX containers, see [Container contents](/max/container/#container-contents). ## v25.1.1 (2025-02-19) Fix performance issues in autoregressive models with paged attention by setting sensible default values for `--max-num-steps` that are platform-specific. ## v25.1 (2025-02-13) * [Highlights](#25-1-highlights) * [Documentation](#25-1-docs) * [MAX Serve](#25-1-serve) * [MAX models](#25-1-max-models) * [MAX Engine](#25-1-engine) * [Graph APIs](#25-1-graph) * [Pipeline APIs](#25-1-pipelines) * [GPU programming](#25-1-gpus) * [Mojo](#25-1-mojo) ### ✨ Highlights {#25-1-highlights} * **Custom ops for GPUs** Our new custom op API allows you to extend MAX Engine with new graph operations written in Mojo that execute on either CPU or GPU, providing full composability and extensibility for your models. See more in the section about [GPU programming](#25-1-gpus). * **Enhanced support for agentic workflows** MAX Serve now supports function calling, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. [Learn more about function calling and tool use](/max/serve/function-calling). MAX Serve now supports structured output (also known as constrained decoding) for MAX models on GPU. This allows you to enforce the output format from a model using an input schema that defines the output structure. [Learn more about structured output](/max/serve/structured-output). * **Extended model architecture support** * MAX Serve now supports multimodal models that take both text and image inputs. For example, see [how to deploy Llama 3.2 Vision](/max/tutorials/deploy-llama-vision). * MAX Serve now supports text embedding models. Learn how to [deploy a text embedding model](/max/tutorials/run-embeddings-with-max-serve). * **New `max-pipelines` CLI tool** Instead of cloning our GitHub repo to access our latest GenAI models, you can instead install the `max-pipelines` CLI tool and quickly run an inference or deploy an endpoint. Learn more in the [`max-pipelines` docs](/max/max-pipelines). ### Documentation {#25-1-docs} New tutorials: * [Build custom ops for GPUs](/max/tutorials/build-custom-ops) * [Serverless GPU inference on Google Cloud Run](/max/tutorials/deploy-serverless-cloud-run) * [Generate image descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision) * [Deploy a text embedding model](/max/tutorials/run-embeddings-with-max-serve) Other docs: * [Function calling and tool use](/max/serve/function-calling) * [Structured output](/max/serve/structured-output) * [Prefix caching with PagedAttention](/max/serve/prefix-caching) * [max-pipelines](/max/max-pipelines) ### MAX Serve {#25-1-serve} * The `/v1/completions` REST endpoint now supports: * Pre-tokenized prompts. * Image inputs for multimodal models such as `Llama-3.2-11B-Vision-Instruct`. For an example, see [how to generate image descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision). **Known issue:** You might receive faulty results because some parts of the text prompt get ignored for certain input combinations. We've identified the problem and will have a fix in a subsequent [nightly release](/max/packages/#nightly-release). * Function calling and tool use, which allows you to instruct your model to interact with other systems, such as retrieve data and execute external tasks. [Learn more about function calling and tool use](/max/serve/function-calling). * Structured output (also known as constrained decoding), which allows you to enforce the output format from a model using a JSON schema and the `response_format` field. To enable constrained decoding pass `--enable-structured-output` when running the server. However, this feature currently works for MAX models on GPU only (support for PyTorch models and CPU is in progress). [Learn more about structured output](/max/serve/structured-output). * Added support for the `/v1/embeddings` API endpoint, allowing you to generate vector representations using embedding models. See how to [deploy a text embedding model](/max/tutorials/run-embeddings-with-max-serve). * Max Serve can evict requests when the number of available pages in the PagedAttention KVCache is limited. Before, the KV manager would throw an OOM error when a batch that cannot fit in the cache was scheduled. ### MAX models {#25-1-max-models} * Added the [`max-pipelines`](/max/max-pipelines) CLI tool that simplifies the process to run inference with GenAI models (specified with a Hugging Face repo ID) and deploy them to a local endpoint with MAX Serve. Previously, running or serving these models required cloning the [modular/max](https://github.com/modular/max) GitHub repo and then running commands such as `magic run llama3`. These model-specific commands like `llama3` and `replit` commands have been removed. They're now standardized and subsumed by flags like `--model-path` in the `max-pipelines` tool. Arguments such as `--max-length` and `--weight-path` are also still supported by `max-pipelines`. To view a list of supported model architectures from Hugging Face, run `max-pipelines list`. * Added support for PagedAttention, which improves memory efficiency by partitioning the KV cache into smaller blocks, reducing fragmentation and enabling larger inference batches. You can enable it with `--cache-strategy=paged` and `--kv-cache-page-size` with a value that's a multiple of 128. * Added support for prefix caching in all cases where PagedAttention is supported. This allows for more efficient usage of KVCache and improved prefill performance for workloads with common prefixes. You can enable it by setting `--enable-prefix-caching`. For more information, see [Prefix caching with PagedAttention](/max/serve/prefix-caching). * Batch size and max length are now inferred from available memory and the HF Models' default values for max length, respectively. If a configuration leads to an OOM, then we provide recommendations (to the best of our ability) to the user to fit the model into memory. * Added support for heterogeneous KV caches for multi-modal models, such as Llama Vision, which cache different KV states for self and cross attention layers. * Added support for embedding models, starting with MPNet. For example: ```shell max-pipelines generate \ --model-path=sentence-transformers/all-mpnet-base-v2 \ --prompt="Encode this sentence." ``` Also see [how to deploy a text embedding model](/max/tutorials/run-embeddings-with-max-serve). * Added support for image and text multimodal models: * `max-pipelines generate` now accepts image input with `--image_url`. * Added an experimental Pixtral pipeline you can run as follows: ```shell max-pipelines generate \ --model-path=mistral-community/pixtral-12b \ --prompt="What is in this image? [IMG]" \ --image_url=/images/artwork/max-serve-cloud.png ``` The pipeline is automatically used for all models implementing the `LlavaForConditionalGeneration` architecture. The implementation currently has a limit of one image. We plan support an arbitrary number of images of mixed sizes soon. * Added an experimental Llama Vision pipeline you can run as follows: ```shell max-pipelines generate \ --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \ --prompt="What is in this image?" \ --image_url=/images/artwork/max-serve-cloud.png ``` The pipeline is automatically used for all models implementing the `MllamaForConditionalGeneration` architecture. Note: This model is gated and requires that you set the [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken) environment variable. See [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). * See [how to generate image descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision). * Added support for the `Qwen2ForCausalLM` model architecture (such as `Qwen/Qwen2.5-7B-Instruct`). For example: ```shell max-pipelines generate \ --model-path=Qwen/Qwen2.5-7B-Instruct \ --prompt="Write bubble sort in python" \ --quantization-encoding bfloat16 ``` * Added support for offline batched inference for text-based LLMs, allowing you to load a model and run inference with a batch of inputs directly from Python, instead of relying on an HTTP interface. For an example, see [`examples/offline-inference/basic.py`](https://github.com/modular/modular/blob/main/examples/offline-inference/basic.py). * The `--max-cache-batch-size` flag has been deprecated in favor of `--max-batch-size`. Using `--max-cache-batch-size` now emits a deprecation warning and will stop working in a future release. * The `--use-gpu` flag has been deprecated in favor of `--devices=cpu`, `--devices=gpu`, or `--devices=gpu-0,gpu-1,...`. If the device isn't specified, the model runs on the first available GPU, or CPU if no GPUs are available. ### MAX Engine {#25-1-engine} * Improved internal kernel compilation speed 1.5 - 4X across different models. We've revamped our GPU compilation process so that all kernels in a program are compiled together into a single LLVM module, then split into separate kernels afterward. This ensures shared code between kernel entry points is only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b GPU startup time. * Improved initial model execution speed on NVIDIA GPUs. Instead of compiling to PTX and performing just-in-time compilation during runtime, we now generate CUBIN binaries directly. While this increases initial compilation time, it significantly improves execution speed. * The kernels have been further tuned for performance on NVIDIA A100 GPUs. #### Graph APIs {#25-1-graph} * You can now write custom operations (ops) in Mojo, and add them to a graph constructed in Python, using [`custom()`](/max/api/python/graph/ops#max.graph.ops.custom) and [`inplace_custom()`](/max/api/python/max/graph/ops#max.graph.ops.inplace_custom). For more detail, see the section below about [GPU programming](#25-1-gpus). * Cached compiled MAX graphs that make use of custom operations now get invalidated when the implementation of the custom operations change. * [`Graph.add_weight()`](/max/api/python/graph/Graph#max.graph.Graph.add_weight) now takes an explicit `device` argument. This enables explicitly passing GPU-resident weights to [`session.load()`](/max/api/python/engine#max.engine.InferenceSession.load) via the weights registry to initialize the model. * [`max.graph.Weight`](/max/api/python/graph/Weight) now inherits from `TensorValue`, allowing you to call `weight.cast()` or `weight.T`. As such, the [`TensorValue`](/max/api/python/graph/TensorValue#max.graph.TensorValue) no longer accepts `Weight` for the `value` argument. #### Pipeline APIs {#25-1-pipelines} * [`TextTokenizer.new_context()`](/max/api/python/pipelines/tokenizer#max.pipelines.tokenizer.TextTokenizer.new_context) now supports tool definitions passed through its `request` argument (via `TokenGeneratorRequest.tools`). It also now supports JSON schemas passed through its `request` argument (via [`TokenGeneratorRequest.response_format`](/max/api/python/pipelines/interfaces/#max.pipelines.interfaces.TokenGeneratorRequest.response_format)). * Removed the default `num_steps` value for [`TokenGenerator.next_token()`](/max/api/python/pipelines/interfaces/#max.pipelines.interfaces.TokenGenerator.next_token), ensuring users pass a value, reducing the potential for silent errors. * [`KVCacheStrategy`](/max/api/python/pipelines/kv_cache/cache_params#max.pipelines.kv_cache.cache_params.KVCacheStrategy) now defaults to `MODEL_DEFAULT`. As opposed to the previous setting which always used the "continuous" caching strategy, KV caching strategy is now defaulted on an architecture-specific basis to ensure the most optimized caching strategy is used. * The [`Linear`](/max/api/python/nn/linear#max.nn.linear.Linear) layer now has a `create()` class method that automatically creates specializations of `Linear` for non-quantized, k-quant, or GPTQ layers. * Added [`nn.Conv1D`](/max/api/python/nn/conv#max.nn.conv.Conv1D) for audio models like Whisper. #### GPU programming {#25-1-gpus} This release includes all new APIs to program on GPUs. The way to write code for GPUs is to create custom operations with GPU functions that you can load into a MAX graph. This foundational API includes a few key components: * Mojo APIs to write custom op functions: * The [`@compiler.register`](/max/api/mojo-decorators/compiler-register) decorator is applied to a Mojo struct that implements a custom op in an `execute()` function—for either CPU or GPU—and a `shape()` function that defines the custom op's output tensor. * The [`max.tensor`](/max/api/mojo/tensor/) package adds essential Mojo APIs for writing custom ops, such as: * The [`foreach()`](/max/api/mojo/tensor/managed_tensor_slice/foreach) function, which efficiently executes an element-wise computation in parallel on either a GPU or CPU. * The [`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice) type defines the input and output tensors for the custom op. * Python APIs to load custom ops into a model: * The [`custom()`](/max/api/python/graph/ops#max.graph.ops.custom) and [`inplace_custom()`](/python/max/graph/ops#max.graph.ops.inplace_custom) functions allow you to add the previously-defined Mojo custom op to a MAX graph written in Python. * The [`InferenceSession`](/max/api/python/engine#max.engine.InferenceSession) constructor accepts the custom op implementation as a [Mojo package](/mojo/manual/packages#mojo-packages) in the `custom_extensions` argument. For more detail, see the [tutorial to build custom ops for GPUs](/max/tutorials/build-custom-ops), or check out this [simple example of a custom op](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/add_custom.mojo). Additionally, we've added a new [`gpu` package](/mojo/stdlib/gpu/) to the Mojo standard library that provides low-level programming constructs for working with GPUs. These APIs let you do things that you can't currently do with the high-level `foreach()` abstraction above. The Mojo `gpu` APIs allow you to manually manage interaction between the CPU host and GPU device, manage memory between devices, synchronize threads, and more. For some examples, see [`vector_addition.mojo`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/vector_addition.mojo) and [`top_k.mojo`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/top_k.mojo). ### Mojo {#25-1-mojo} Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the [Mojo changelog](/mojo/changelog). ## v24.6 (2024-12-17) This is a huge update that offers a first look at our serving library for MAX on GPUs! * [Highlights](#24-6-highlights) * [Documentation](#24-6-docs) * [MAX Serve](#24-6-serve) * [MAX models](#24-6-models) * [MAX Engine](#24-6-engine) * [Driver APIs](#24-6-driver-api) * [Graph compiler](#24-6-graph-compiler) * [Graph APIs](#24-6-graph-api) * [Custom op registration](#24-6-custom-ops) * [Numeric kernels](#24-6-kernels) * [Mojo](#24-6-mojo) Also check out our [blog post introducing MAX 24.6](https://www.modular.com/blog/introducing-max-24-6-a-gpu-native-generative-ai-platform). ### ✨ Highlights {#24-6-highlights} * **MAX Engine on GPUs preview** We're excited to share a preview of MAX Engine on GPUs. We've created a few tutorials that demonstrate MAX's ability to run GenAI models with our next-generation MAX graph compiler on NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs). You can experience it today by [deploying Llama 3 on an A100 GPU](/max/tutorials/max-serve-local-to-cloud). * **MAX Serve preview** This release also includes an all-new serving interface called MAX Serve. It's a Python-based serving layer that supports both native MAX models when you want a high-performance deployment, and off-the-shelf PyTorch LLMs from Hugging Face when you want to explore and experiment—all with GPU support. It provides an OpenAI-compatible REST endpoint for inference requests, and a Prometheus-compatible metrics endpoint. You can use a `magic` command to start a local server , or use our ready-to-deploy MAX container to start an endpoint in the cloud. Try it now [with an LLM from Hugging Face](/max/tutorials/deploy-pytorch-llm). * **Upgraded MAX models** As we continue to build our Python-based MAX Graph API that allows you to build high-performance GenAI models, we've made a ton of performance improvements to the existing models and added a few new models to our GitHub repo. All the Python-based MAX models now support GPUs and broad model architectures. For example, [`llama3`](https://github.com/modular/modular/tree/main/max/pipelines/architectures/llama3) adds compatibility for the LlamaForCausalLM family, which includes over 20,000 model variants and weights on Hugging Face. ### Documentation {#24-6-docs} New tutorials: * [Deploy Llama 3 on GPU with MAX Serve](/max/tutorials/max-serve-local-to-cloud) * [Deploy a PyTorch model from Hugging Face](/max/tutorials/deploy-pytorch-llm) * [Deploy Llama 3.1 on GPU-powered Kubernetes clusters](/max/tutorials/deploy-max-serve-on-kubernetes) * [Get started with MAX Graph in Python](/max/tutorials/get-started-with-max-graph-in-python) Other new docs: * [MAX container](/max/container) * [Benchmark MAX Serve](https://github.com/modular/modular/tree/main/benchmark) Also, our documentation is now available for **MAX nightly builds**! If you're building with a [MAX nightly release](/max/packages#nightly-release), you can switch to see the nightly docs using a toggle to the right of the search bar. ### MAX Serve {#24-6-serve} This release includes a preview of our Python-based serving library called MAX Serve. It simplifies the process to deploy your own inference server with consistent and reliable performance. MAX Serve currently includes the following features: * Deploys locally and to the cloud with our [MAX container image](/max/container), or with the `magic` CLI. * An OpenAI-compatible server with streaming `/chat/completion` and `/completion` endpoints for LLM inference requests. * Prometheus-compatible [metrics endpoint](/max/container#metrics) with LLM KPIs (TTFT and ITL) for monitoring and evaluating performance. * Supports most `TextGeneration` Hugging Face Hub models. * Multiprocess HTTP/model worker architecture to maximize CPU core utilization by distributing multiple incoming requests across multiple processes, ensuring both high throughput and responsiveness. * Continuous heterogeneous batching to combine multiple incoming requests into a single inference (no waiting to fill a batch size) and improve total throughput. There's much more still in the works for MAX Serve, but you can try it today with our tutorials to [Deploy Llama 3 on GPU with MAX Serve](/max/tutorials/max-serve-local-to-cloud) and [Deploy a PyTorch model from Hugging Face](/max/tutorials/deploy-pytorch-llm). **Known issues:** * While this release is enough to support typical chatbot applications, this release does not yet support the function-calling portion of the OpenAI API specification needed to enable robust agentic workflows. * Sampling is still limited and doesn't currently respect temperature or other sampling-related API request input. * Structured generation is not supported. * Support for multi-modal models is still nascent. ### MAX models {#24-6-models} All of our Python-based GenAI [models on GitHub](https://github.com/modular/modular/tree/main/max/pipelines/architectures) now support GPUs! As we add more models, we're also building a robust set of libraries and infrastructure that make it easier to build and deploy a growing library of LLMs. Some of which is available in a new [`max.pipelines`](/max/api/python/pipelines/) package and some of it is alongside the [models on GitHub](https://github.com/modular/modular/tree/main/max/pipelines/architectures). Here are just some of the highlights: * Deep integration with the Hugging Face ecosystem for a quick-to-deploy experience, such as using HF Model Hub tools to fetch config files, support for weights in [safetensor](https://github.com/huggingface/safetensors) format, support for HF tokenizers, and more. (We also support GGUF weight formats.) * Expanded set of model abstractions for use by different LLM architectures: * Attention layers (including highly optimized implementations with configurable masking, like [`AttentionWithRope`](https://github.com/modular/modular/tree/main/max/nn/attention/attention_with_rope.py)). The optimized attention layers include variants that accept an attention mask. More memory-efficient variants that don't take a mask instead take a "mask functor" argument to the kernel, which implements masking without materializing a mask by computing a mask value from input coordinates on the fly. * Transformers such as [`Transformer` and `TransformerBlock`](https://github.com/modular/modular/tree/main/max/nn/transformer/transformer.py). These include an initial implementation of ragged tensors—tensors for which each dimension can have a different size, avoiding the use of padding tokens by flattening a batch of sequences of differing lengths. * Common layers such as [`RMSNorm`](https://github.com/modular/modular/tree/main/max/nn/norm/rms_norm.py) , [`Embedding`](https://github.com/modular/modular/tree/main/max/nn/embedding.py), and [`Sequential`](https://github.com/modular/modular/tree/main/max/nn/sequential.py). * KV cache management helpers, like [`ContinuousBatchingKVCacheManager`](/max/api/python/pipelines/kv_cache/continuous_batching_cache#max.pipelines.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager). * Low-level wrappers over optimized kernels like [`fused_qk_ragged_rope`](https://github.com/modular/modular/tree/main/max/nn/kernels.py). These are custom fused kernels that update the KV cache in place. Although they are custom, they reuse the underlying kernel implementation by passing in lambda functions used to retrieve inputs and write to outputs in place. * Added generalized interfaces for text generation such as [`TokenGenerator`](/max/api/python/pipelines/interfaces#max.pipelines.interfaces.TokenGenerator) and [`PipelineModel`](/max/api/python/pipelines/pipeline#max.pipelines.pipeline.PipelineModel), which provide modularity within the models and serving infrastructure. Also added a plug-in mechanism ([`PipelineRegistry`](/max/api/python/pipelines/registry#max.pipelines.registry.PipelineRegistry)) to more quickly define new models, tokenizers, and other reusable components. For example, anything that conforms to [`TokenGenerator`](/max/api/python/pipelines/interfaces#max.pipelines.interfaces.TokenGenerator) can be served using the LLM infrastructure within MAX Serve. We then used this interface to create the following: * An optimized [`TextGenerationPipeline`](/max/api/python/pipelines/pipeline#max.pipelines.pipeline.TextGenerationPipeline) that can be combined with any compatible graph and has powerful performance features like graph-based multi-step scheduling, sampling, KV cache management, ragged tensor support, and more. * A generic [`HFTextGenerationPipeline`](/max/api/python/pipelines/hf_pipeline#max.pipelines.hf_pipeline.HFTextGenerationPipeline) that can run any Hugging Face model for which we don't yet have an optimized implementation in eager mode. * Models now accept weights via a weights registry, which is passed to the [`session.load()`](/max/api/python/engine#max.engine.InferenceSession.load) method's `weights_registry` argument. The decoupling of weights and model architecture allows implementing all of the different fine-tunes for a given model with the same graph. Furthermore, because the underlying design is decoupled, we can later expose the ability to compile a model once and swap weights out on the fly, without re-compiling the model. * Added generic implementations of common kernels, which allow you to plug-in different batching strategies (ragged or padded), KV cache management approaches (continuous batching), masking (causal, sliding window, etc.), and position encoding (RoPE or ALIBI) without having to re-write any kernel code. (More about this in a future release.) * Multi-step scheduling to run multiple token-generation steps on GPU before synchronizing to the CPU. **Updated models:** * Significant performance upgrades for [Llama 3](https://github.com/modular/modular/tree/main/max/pipelines/architectures/llama3), and expanded compatibility with the `LlamaForCausalLM` models family. For example, it also supports Llama 3.2 1B and 3B text models. **New models:** * [Mistral NeMo](https://github.com/modular/modular/tree/main/max/pipelines/architectures/mistral) (and other `MistralForCausalLM` models) * [Replit Code V1.5 3B](https://github.com/modular/modular/tree/main/max/pipelines/architectures/replit) **Known issues:** * The Q4 quantized models currently work on CPU only. * Using a large setting for `top-k` with the Llama 3.1 model may lead to segmentation faults for certain workloads when run on NVIDIA GPUs. This should be resolved in the latest nightly MAX builds. * The models currently use a smaller default context window than the `max_seq_len` specified in the Hugging Face configuration files for a given model. This can be manually adjusted by setting the `--max-length` parameter to the desired context length when serving a model. * Some variants of the supported core models (like `LlamaForCausalLM` with different number of heads, head sizes, etc.) might not be fully optimized yet. We plan to fully generalize our implementations in a future release. ### MAX Engine {#24-6-engine} MAX Engine includes a lot of the core infrastructure that enables MAX to accelerate AI models on any hardware, such as the graph compiler, runtime, kernels, and the APIs to interact with it all, and it all works without external dependencies such as PyTorch or CUDA. This release includes a bunch of performance upgrades to our graph compiler and runtime. We've added support for NVIDIA GPU architectures (including A100, A10, L4, and L40 GPUs), and built out new infrastructure so we can quickly add support for other GPU hardware. **Engine API changes:** * [`InferenceSession`](/max/api/python/engine#max.engine.InferenceSession) now accepts a `custom_extensions` constructor argument, same as `load()`, to specify model extension libraries. * The [`Model`](/max/api/python/engine#max.engine.Model) object is now callable to run an inference. **Breaking changes**: * `Model.execute()` signature changed to support GPUs. * The [`execute()`](/max/api/python/engine#max.engine.Model.execute) function currently doesn't accept keyword arguments. Instead you can pass tensors as a [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor), `int`, `float`, `bool`, [`np.generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic), or [`DLPackArray`](/max/api/python/driver#max.driver.DLPackArray) ([DLPack](https://github.com/dmlc/dlpack)). Note that both PyTorch and NumPy arrays implement the DLPack protocol, which means you can also pass either of those types to `execute()`. * [`execute_legacy()`](/max/api/python/engine#max.engine.Model.execute_legacy) preserves the semantics of `execute()` with support for keyword arguments to help with migration, but will be removed in a future release. `execute_legacy()` doesn't support GPUs. * Calling `execute()` with positional arguments still works the same. #### Driver APIs {#24-6-driver-api} MAX Driver (the [`max.driver`](/max/api/python/driver) module) is a new component of MAX Engine that's still a work in progress. It provides primitives for working with heterogeneous hardware systems (GPUs and CPUs), such as to allocate on-device memory, transfer data between host and device, query device stats, and more. It's a foundation on which other components of MAX Engine operate (for example, `InferenceEngine` now uses [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor) to handle model inputs and outputs). **Driver API changes:** * Added `CUDA()` device to open an NVIDIA GPU. * Added support for fp16 and bfloat16 dtypes. * Expanded functionality for `max.driver.Device`, with new class methods and properties. We are still working on building this out to support more accelerator features. * [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor) (and the `InferenceSession.load()` argument `weights_registry` ) now supports zero-copy interoperability with NumPy arrays and PyTorch tensors, using [DLPack](https://github.com/dmlc/dlpack) / [`DLPackArray`](/max/api/python/driver#max.driver.DLPackArray). * [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor) has new methods, such as `from_dlpack()`, `element_size()` , `to()`, `to_numpy()`, `view()`, `zeros()`, and more. MAX Driver APIs are still changing rapidly and not yet ready for general use. We'll publish more documentation in a future release. **Known issues:** * MAX Driver is currently limited to managing just one NVIDIA GPU at a time (it does not yet support multi-GPU). It also does not yet support remote devices. * DLPack support is not complete. For example, streams are not yet supported. #### Graph compiler {#24-6-graph-compiler} When you load a model into MAX Engine, the graph compiler is the component that inspects and optimizes all graph operations (ops) to deliver the best run time performance on each device. This release includes various graph compiler improvements: * Major extensions to support NVIDIA GPUs (and other devices in the future), including async copies and caching of JIT'd kernels. * The runtime now performs scheduling to enable GPU compute overlap with the CPU. * New transformations to the Mojo kernels to enable a number of optimizations, including specialization on tensor dimensions, specialization on target hardware, specialization on non-tensor dimension input to kernels, automatic kernel fusion between operators, and more. * New algebraic simplifications and algorithms for ops such as horizontal fusion of matrix multiplications. * New CPU-side primitives for device management that are automatically transformed and optimized to reduce overhead (MAX does not need to use things like CUDA Graphs). * Updated memory planning to preallocate device memory (hoist computation from inference runtime to initialization time) and reduce per-inference overhead. #### Graph APIs {#24-6-graph-api} The graph compiler is also exposed through the MAX Graph APIs (the [`max.graph`](/max/api/python/graph/) package), which allow you to build high-performance GenAI models in Python. **Graph API changes:** * Python stack traces from model execution failures now include a trace to the original op-creation, allowing for easier debugging during development. * The [`max.graph`](/max/api/python/graph/) APIs now include preliminary support for symbolic algebraic expressions using [`AlgebraicDim`](/max/api/python/graph/type#max.graph.type.AlgebraicDim), enabling more powerful support for checked dynamic shapes. This allows `-Dim("x") - 4`. Furthermore, the algebraic expressions simplify to a canonical form, so that for example `-Dim("x") - 4 == -(Dim("x") + 4)` holds. * More advanced dtype promotion now allows [`TensorValue`](/max/api/python/graph/TensorValue) math operators to just work when used with NumPy arrays and python primitives. * [`TensorValue`](/max/api/python/graph/TensorValue) has new methods, such as `broadcast_to()`, `cast()`, `flatten()`, `permute()`, and more. * Added [`BufferValue`](/max/api/python/graph/BufferValue), which allows for device-resident tensors that are read and mutated within the graph. * [`DType`](/max/api/python/dtype#max.dtype.DType) has new methods/properties, `align`, `size_in_bytes`, and `is_float()`. * [`Value`](/max/api/python/graph/Value) constructor accepts more types for `value`. * [`TensorValue`](/max/api/python/graph/TensorValue) constructor accepts more types for `value`. * [`TensorValue.rebind()`](/max/api/python/graph/TensorValue#max.graph.TensorValue.rebind) accepts a new `message` argument. **Breaking changes:** * [`Graph.add_weight()`](/max/api/python/graph/Graph#max.graph.Graph.add_weight) now accepts [`Weight`](/max/api/python/graph/Weight#max.graph.Weight) and returns [`TensorValue`](/max/api/python/graph/TensorValue). [`Weight`](/max/api/python/graph/Weight#max.graph.Weight) is essentially a named placeholder for a tensor that knows its name, dtype, shape, and optionally device and quantization encoding. `Graph.add_weight()` stages an op in the graph that is populated by a named weight in the weights registry passed to `session.load`. * The [`Weight`](/max/api/python/graph/Weight#max.graph.Weight) constructor arguments changed; added `align` , `dtype` , and `shape`; removed `assign` , `filepath`, `offset`, and `value`. * The `ops.scalar()` method was removed along with the `is_static()` and `is_symbolic()` methods from all `graph.type` objects. * Instead of `ops.scalar()`, use [`ops.constant()`](/max/api/python/graph/ops#max.graph.ops.constant). * Instead of `is_static()` and `is_symbolic()`, use `isinstance(dim, SymbolicDim)` and `isinstance(dim, StaticDim)`. The MAX Graph APIs are not ready for general use but you can [experiment with it now by following this tutorial](/max/tutorials/get-started-with-max-graph-in-python). We'll add more documentation when we finish some API redesigns. #### Custom op registration {#24-6-custom-ops} Although the APIs to write custom operators (ops) isn't ready for general use, this release includes a significant redesign that lays the groundwork. You might notice some associated APIs in this release and more APIs in the nightlies, so here's a little about the work in progress: * The custom op APIs will allow you to extend MAX Engine with new ops written in Mojo, providing full composability and extensibility for your models. It's the exact same API we use to write MAX Engine's built-in ops such as `matmul`. That means your custom ops can benefit from all our compiler optimization features such as kernel fusion—your ops are treated the same as all the ops included "in the box." * The new API requires far less adornment at the definition site to enable the MAX model compiler to optimize custom ops along with the rest of the graph (compared to our previous version that used `NDBuffer`). * Custom ops support "destination passing style" for tensors. * The design composes on top of Mojo's powerful meta programming, as well as the kernel libraries abstractions for composable kernels. We'll publish more documentation when the custom op API is ready for general use. Check out the MAX repo's `nightly` branch to see the latest [custom op examples](https://github.com/modular/modular/tree/main/examples/custom_ops). **Known issues:** * Custom ops don't have type or lifetime checking. They also don't reason about mutability. Expect lots of sharp corners and segfaults if you hold them wrong while we improve this! #### Numeric kernels {#24-6-kernels} The GPU kernels for MAX Engine are built from the ground up in Mojo with no dependencies on external vendor code or libraries. This release includes the following kernel improvements: * AttenGen: a novel way to express attention pattern that's able to express different attention masks, score functions, as well as caching strategies. * State-of-the-art matrix multiplication algorithms with optimizations such as the following: * Pipelining and double-buffering to overlap data transfer and computation and to hide memory access latency (for both global and shared memory). * Thread swizzling to avoid shared memory bank conflicts associated with tensor core layouts. * Block swizzling to increase L2 cache locality. * SplitK/StreamK GEMM algorithms: divides the computation along the shared K dimension into smaller matrices which can then be executed independently on streaming multiprocessors (such as CUDA cores). These algorithms are ideal for matrices with large K dimension but small M dimension. * Large context length MHA: uses SplitK/StreamK to implement the attention mechanism and eliminate the need of a huge score matrix, which drastically reduces memory usage/traffic to enable large context length. * DualGemm: accelerates the multi-layer perceptron (MLP) layers where the left-hand side (LHS) is shared between two matrix multiplications. **Known issues:** * The MAX kernels are optimized for bfloat16 on GPUs. * Convolution on GPU is not performance optimized yet. * Although v24.6 technically runs on H100, it doesn't include performance-optimized kernels for that device yet and it isn't recommended. ### Mojo {#24-6-mojo} Mojo is a crucial component of the MAX stack that enables all of MAX's performance-oriented code across hardware. For all the updates to the Mojo language, standard library, and tools, see the [Mojo changelog](/mojo/changelog#v246-2024-12-17). ## v24.5 (2024-09-13) ### ✨ Highlights * Mojo and MAX are magical! We've created a new package and virtual environment manager, `magic`, for MAX and Mojo. [Check it out!](/magic/) * New [Llama3.1 pipeline](https://github.com/modular/modular/tree/main/max/pipelines/architectures) built with the new MAX Graph Python API. * We have not one, but two new Python APIs that we're introducing in this release: * [MAX Graph Python API](#max-graph-python-api) * [MAX Driver Python API](#max-driver-python-api) ### ⭐️ New * Added `repeat_interleave` graph op. * Added caching for MAX graph models. This means that graph compilation is cached and the executable model is retrieved from cache on the 2nd and subsequent runs. Note that the model cache is architecture specific and isn't portable across different targets. * Support for Python 3.12. #### MAX Graph Python API This Python API will ultimately provide the same low-level programming interface for high-performance inference graphs as the Mojo API. As with the Mojo API, it's an API for graph-building only, and it does not implement support for training. You can take a look at how the API works in the [MAX Graph Python API reference](/max/api/python/graph/). #### MAX Driver Python API The MAX Driver API allows you to interact with devices (such as CPUs and GPUs) and allocate memory directly onto them. With this API, you interact with this memory as tensors. Note that this API is still under development, with support for non-host devices, such as GPUs, planned for a future release. To learn more, check out the [MAX Driver Python APIreference](/max/api/python/driver). #### MAX C API New APIs for adding torch metadata libraries: * `M_setTorchMetadataLibraryPath` * `M_setTorchMetadataLibraryPtr` ### 🦋 Changed #### MAX Engine performance * Compared to v24.4, MAX Engine v24.5 generates tokens for Llama an average of 15%-48% faster. #### MAX C API Simplified the API for adding torch library paths, which now only takes one path per API call, but can be called multiple times to add paths to the config: * `M_setTorchLibraries` -> `M_setTorchLibraryPath` ### ⚠️ Deprecated * The `max` command line tool is no longer supported and will be removed in a future release. ### ❌ Removed * Dropped support for Ubuntu 20.04. If you're using Ubuntu, we currently support Ubuntu 22.04 LTS only. * Dropped support for Python 3.8. * Removed built-in PyTorch libraries from the max package. See the [FAQ](/max/faq) for information on supported torch versions. ## v24.4 (2024-06-07) ### 🔥 Legendary * MAX is now available on macOS! [Try it now](/max). * New quantization APIs for MAX Graph. You can now build high-performance graphs in Mojo that use the latest quantization techniques, enabling even faster performance and more system compatibility for large models. Learn more in the guide to [quantize your graph weights](/max/graph/quantize). ### ⭐️ New #### MAX Mojo APIs * Added AI pipeline examples in the `max` repo, with Mojo implementations for common transformer layers, including quantization support. * New Llama3 pipeline built with MAX Graph. * New Replit Code pipeline built with MAX Graph. * New TinyStories pipeline (based on TinyLlama) that offers a simple demo of the MAX Graph quantization API. * Added [`max.graph.checkpoint`](/max/api/mojo/graph/checkpoint/) package to save and load model weights. All weights are stored in a [`TensorDict`](/max/api/mojo/graph/checkpoint/tensor_dict/TensorDict). You can save and load a `TensorDict` to disk with [`save()`](/max/api/mojo/graph/checkpoint/save_load/save) and [`load()`](/max/api/mojo/graph/checkpoint/save_load/load) functions. * Added MAX Graph quantization APIs: * Added quantization encodings [`BFloat16Encoding`](/max/api/mojo/graph/quantization/encodings/BFloat16Encoding), [`Q4_0Encoding`](/max/api/mojo/graph/quantization/encodings/Q4_0Encoding), [`Q4_KEncoding`](/max/api/mojo/graph/quantization/encodings/Q4_KEncoding), and [`Q6_KEncoding`](/max/api/mojo/graph/quantization/encodings/Q6_KEncoding). * Added the [`QuantizationEncoding`](/max/api/mojo/graph/quantization/quantization_encoding/QuantizationEncoding) trait so you can build custom quantization encodings. * Added [`Graph.quantize()`](/max/api/mojo/graph/graph/Graph#quantize) to create a quantized tensor node. * Added [`qmatmul()`](/max/api/mojo/graph/ops/quantized_ops/qmatmul) to perform matrix-multiplication with a float32 and a quantized matrix. * Added some MAX Graph ops: * [`avg_pool()`](/max/api/mojo/graph/ops/convolution/avg_pool) * [`max_pool()`](/max/api/mojo/graph/ops/convolution/max_pool) * [`conv2d()`](/max/api/mojo/graph/ops/convolution/conv2d) * [`conv3d()`](/max/api/mojo/graph/ops/convolution/conv3d) * [`layer_norm()`](/max/api/mojo/graph/ops/linalg/layer_norm) * [`tile()`](/max/api/mojo/graph/ops/linalg/tile) * [`select()`](/max/api/mojo/graph/ops/slicing/select) * Added a [`layer()`](/max/api/mojo/graph/graph/Graph#layer) context manager and [`current_layer()`](/max/api/mojo/graph/graph/Graph#current_layer) function to aid in debugging during graph construction. For example: ```mojo with graph.layer("foo"): with graph.layer("bar"): print(graph.current_layer()) # prints "foo.bar" x = graph.constant[DType.int64](1) graph.output(x) ``` This adds a path `foo.bar` to the added nodes, which will be reported during errors. * Added [`format_system_stack()`](/max/api/mojo/graph/error/format_system_stack) function to format the stack trace, which we use to print better error messages from [`error()`](/max/api/mojo/graph/error/error). * Added [`TensorMap.keys()`](/max/api/mojo/engine/tensor_map/TensorMap#keys) to get all the tensor key names. #### MAX C API Miscellaneous new APIs: * `M_cloneCompileConfig()` * `M_copyAsyncTensorMap()` * `M_tensorMapKeys()` and `M_deleteTensorMapKeys()` * `M_setTorchLibraries()` ### 🦋 Changed #### MAX Mojo API * [`EngineNumpyView.data()`](/max/api/mojo/engine/tensor/EngineNumpyView#unsafe_ptr) and [`EngineTensorView.data()`](/max/api/mojo/engine/tensor/EngineTensorView#unsafe_ptr) functions that return a type-erased pointer were renamed to `unsafe_ptr()`. * [`TensorMap`](/max/api/mojo/engine/tensor_map/TensorMap) now conforms to `CollectionElement` trait to be copyable and movable. * `custom_nv()` was removed, and its functionality moved into [`custom()`](/max/api/mojo/graph/ops/custom_ops/custom) as an function overload, so it can now output a list of tensor symbols. ## v24.3 (2024-05-02) ### 🔥 Legendary * You can now write custom ops for your models with Mojo! Learn more about [MAX extensibility](/max/custom-ops/). ### 🦋 Changed * Added support for named dynamic dimensions. This means you can specify when two or more dimensions in your model's input are dynamic but their sizes at run time must match each other. By specifying each of these dimension sizes with a name (instead of using `None` to indicate a dynamic size), the MAX Engine compiler can perform additional optimizations. See the notes below for the corresponding API changes that support named dimensions. * Simplified all the APIs to load input specs for models, making them more consistent. #### MAX Engine performance * Compared to v24.2, MAX Engine v24.3 shows an average speedup of 10% on PyTorch models, and an average 20% speedup on dynamically quantized ONNX transformers. #### MAX Graph API The [`max.graph`](/max/api/mojo/graph/) APIs are still changing rapidly, but starting to stabilize. * `AnyMoType` renamed to [`Type`](/max/api/mojo/graph/type/Type), `MOTensor` renamed to [`TensorType`](/max/api/mojo/graph/type/TensorType), and `MOList` renamed to [`ListType`](/max/api/mojo/graph/type/ListType). * Removed `ElementType` in favor of using `DType`. * Removed `TypeTuple` in favor of using `List[Type]`. * Removed the `Module` type so you can now start building a graph by directly instantiating a [`Graph`](/max/api/mojo/graph/graph/Graph). * Some new ops in [`max.ops`](/max/api/mojo/graph/ops/), including support for custom ops. See how to [create a custom op in MAX Graph](/max/extensibility/). #### MAX Engine Python API * Redesigned [`InferenceSession.load()`](/max/api/python/engine#max.engine.InferenceSession.load) to replace the confusing `options` argument with a `custom_ops_path` argument. As a result, `CommonLoadOptions`, `TorchLoadOptions`, and `TensorFlowLoadOptions` have all been removed. * [`TorchInputSpec`](/max/api/python/engine#max.engine.TorchInputSpec) now supports named dynamic dimensions (previously, dynamic dimension sizes could be specified only as `None`). This lets you tell MAX which dynamic dimensions are required to have the same size, which helps MAX better optimize your model. #### MAX Engine Mojo API * `InferenceSession.load_model()` was renamed to [`load()`](/max/api/mojo/engine/session/InferenceSession#load). * Redesigned [`InferenceSession.load()`](/max/api/mojo/engine/session/InferenceSession#load) to replace the confusing `config` argument with a `custom_ops_path` argument for use when [loading a custom op](/max/extensibility/), and an `input_specs` argument for use when loading TorchScript models. Doing so removed `LoadOptions` and introduced the new [`InputSpec`](/max/api/mojo/engine/session/InputSpec) type to define the input shape/type of a model (instead of `LoadOptions`). * New [`ShapeElement`](/max/api/mojo/engine/shape_element/ShapeElement) type to allow for named dynamic dimensions (in `InputSpec`). * `max.engine.engine` module was renamed to [`max.engine.info`](/max/api/mojo/engine/info/). #### MAX Engine C API * [`M_newTorchInputSpec()`](/max/api/c/pytorch/config#m_newtorchinputspec) now supports named dynamic dimensions (via new `dimNames` argument). ### ❌ Removed * Removed TensorFlow support in the MAX SDK, so you can no longer load a TensorFlow SavedModel for inference. However, TensorFlow is still available for enterprise customers. We removed TensorFlow because industry-wide TensorFlow usage has declined significantly, especially for the latest AI innovations. Removing TensorFlow also cuts our package size by over 50% and accelerates the development of other customer-requested features. If you have a production use-case for a TensorFlow model, please [contact us](https://www.modular.com/company/contact). * Removed the Python `CommonLoadOptions`, `TorchLoadOptions`, and `TensorFlowLoadOptions` classes. See note above about `InferenceSession.load()` changes. * Removed the Mojo `LoadOptions` type. See the note above about `InferenceSession.load()` changes. ## v24.2.1 (2024-04-11) * You can now import more MAX Graph functions from `max.graph.ops` instead of using `max.graph.ops.elementwise`. For example: ```mojo from max.graph import ops var relu = ops.relu(matmul) ``` ## v24.2 (2024-03-28) * MAX Engine now supports TorchScript models with dynamic input shapes. No matter what the input shapes are, you still need to [specify the input specs](/max/model-formats#specify-torchscript-input-specs) for all TorchScript models. * The Mojo standard library is now open source! Read more about it in [this blog post](https://www.modular.com/blog/the-next-big-step-in-mojo-open-source). * And, of course, lots of Mojo updates, including implicit traits, support for keyword arguments in Python calls, a new `List` type (previously `DynamicVector`), some refactoring that might break your code, and much more. For details, see the [Mojo changelog](/mojo/changelog#v242-2024-03-28). ## v24.1.1 (2024-03-18) This is a minor release that improves error reports. ## v24.1 (2024-02-29) The first release of the MAX platform is here! 🚀 This is a **preview version** of the MAX platform. That means it is not ready for production deployment and designed only for local development and evaluation. Because this is a preview, some API libraries are still in development and subject to change, and some features that we previously announced are not quite ready yet. But there is a lot that you can do in this release! This release includes our flagship developer tools, currently for **Linux only**: * **MAX Engine**: Our state-of-the-art graph compiler and runtime library that executes models from PyTorch and ONNX, with incredible inference speed on a wide range of hardware. * API libraries in Python, C, and Mojo to run inference with your existing models. [See the API references](/max/api). * The `max benchmark` tool, which runs MLPerf benchmarks on any compatible model without writing any code. * The `max visualize` tool, which allows you to visualize your model in Netron after partially lowering in MAX Engine. * An early look at the [MAX Graph API](/max/model-formats#max-graph), our low-level library for building high-performance inference graphs. * **MAX Serving**: A preview of our serving wrapper for MAX Engine that provides full interoperability with existing AI serving systems (such as Triton) and that seamlessly deploys within existing container infrastructure (such as Kubernetes). * A Docker image that runs MAX Engine as a backend for NVIDIA Triton Inference Server. * **Mojo**: The world's first programming language built from the ground-up for AI developers, with cutting-edge compiler technology that delivers unparalleled performance and programmability for any hardware. * The latest version of Mojo, the standard library, and the `mojo` command line tool. These are always included in MAX, so you don't need to download any separate packages. * The Mojo changes in each release are often quite long, so we're going to continue sharing those in the existing [Mojo changelog](/mojo/changelog). Additionally, we've started a new [GitHub repo for MAX](https://github.com/modular/max), where we currently share a bunch of code examples for our API libraries, including some large model pipelines. You can also use this repo to [report issues with MAX](https://github.com/modular/modular/issues/new/choose). ### Model Architecture Support * Added support for the following model architectures: * `OlmoForCausalLM` (such as `allenai/OLMo-1B-0724-hf`) * `GraniteForCausalLM` (such as `ibm-granite/granite-3.1-8b-instruct`) * `Phi3ForCausalLM` (for Microsoft Phi-3 models) * `Qwen2ForCausalLM` (such as Qwen2 models) Example usage: ```sh max-pipelines generate \ --model-path allenai/OLMo-1B-0724-hf \ --prompt "Write bubble sort in mojo" ``` * The `max.pipelines.dataprocessing.tokenizer` and `max.pipelines.dataprocessing.gguf_utils` modules have been removed. * The previously deprecated `PipelineConfig.architecture` field and its corresponding `--architecture` CLI argument have been removed. ### `max-pipelines` CLI * The `--devices` CLI argument now supports a comma-separated list of GPU IDs prefixed with `gpu:` like `--devices=gpu:0,1,2,3`. We no longer support the previous `--devices=gpu-` format. ```sh max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \ --quantization-encoding bfloat16 \ --devices gpu:0,1,2,3 \ --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points." ``` * Removed `--huggingface-repo-id` PipelineConfig option and CLI argument in favor of `--model-path`. * Consolidated `-model-path` and `-weight-path`. If valid `-weight-path`(s) are provided, they'll now override `--model-path`, which in turn handles both local and remote (Hugging Face) cases. If we cannot derive the weights from the `--weight-path`(s), we'll now fall back to the `--model-path`, which has to be set explicitly by the user. * Added `--huggingface-revision` option, to allow selecting a non-default branch or a specific commit in a Hugging Face model repository. --- ## Why Mojo🔥 When we started Modular, we had no intention of building a new programming language. But as we were building our [platform to unify the world's ML/AI infrastructure](https://www.modular.com/blog/the-case-for-a-next-generation-ai-developer-platform), we realized that programming across the entire stack was too complicated. Plus, we were writing a lot of MLIR by hand and not having a good time. What we wanted was an innovative and scalable programming model that could target accelerators and other heterogeneous systems that are pervasive in the AI field. This meant a programming language with powerful compile-time metaprogramming, integration of adaptive compilation techniques, caching throughout the compilation flow, and other features that are not supported by existing languages. And although accelerators are important, one of the most prevalent and sometimes overlooked "accelerators" is the host CPU. Nowadays, CPUs have lots of tensor-core-like accelerator blocks and other AI acceleration units, but they also serve as the "fallback" for operations that specialized accelerators don't handle, such as data loading, pre- and post-processing, and integrations with foreign systems. So it was clear that we couldn't lift AI with just an "accelerator language" that worked with only specific processors. Applied AI systems need to address all these issues, and we decided there was no reason it couldn't be done with just one language. Thus, Mojo was born. ## A language for next-generation compiler technology {#mlir} When we realized that no existing language could solve the challenges in AI compute, we embarked on a first-principles rethinking of how a programming language should be designed and implemented to solve our problems. Because we require high-performance support for a wide variety of accelerators, traditional compiler technologies like LLVM and GCC were not suitable (and any languages and tools based on them would not suffice). Although they support a wide range of CPUs and some commonly used GPUs, these compiler technologies were designed decades ago and are unable to fully support modern chip architectures. Nowadays, the standard technology for specialized machine learning accelerators is MLIR. [MLIR](https://mlir.llvm.org/) is a relatively new open-source compiler infrastructure started at Google (whose leads moved to Modular) that has been widely adopted across the machine learning accelerator community. MLIR’s strength is its ability to build *domain specific* compilers, particularly for weird domains that aren’t traditional CPUs and GPUs, such as AI ASICS, [quantum computing systems](https://github.com/PennyLaneAI/catalyst), FPGAs, and [custom silicon](https://circt.llvm.org/). Given our goals at Modular to build a next-generation AI platform, we were already using MLIR for some of our infrastructure, but we didn't have a programming language that could unlock MLIR's full potential across our stack. While many other projects now use MLIR, Mojo is the first major language designed expressly *for MLIR*, which makes Mojo uniquely powerful when writing systems-level code for AI workloads. ## A member of the Python family Our core mission for Mojo includes innovations in compiler internals and support for current and emerging accelerators, but we don't see any need to innovate in language *syntax* or *community*. So we chose to embrace the Python ecosystem because it is so widely used, it is loved by the AI ecosystem, and because we believe it is a really nice language. The Mojo language has lofty goals: we want full compatibility with the Python ecosystem, we want predictable low-level performance and low-level control, and we need the ability to deploy subsets of code to accelerators. Additionally, we don't want to create a fragmented software ecosystem—we don't want Python users who adopt Mojo to draw comparisons to the painful migration from Python 2 to 3. These are no small goals! Fortunately, while Mojo is a brand-new code base, we aren't really starting from scratch conceptually. Embracing Python massively simplifies our design efforts, because most of the syntax is already specified. We can instead focus our efforts on building Mojo's compilation model and systems programming features. We also benefit from tremendous lessons learned from other languages (such as Rust, Swift, Julia, Zig, Nim, etc.), from our prior experience migrating developers to new compilers and languages, and we leverage the existing MLIR compiler ecosystem. Further, we decided that the right *long-term goal* for Mojo is to adopt the **syntax of Python** (that is, to make Mojo compatible with existing Python programs) and to embrace the CPython implementation for long-tail ecosystem support. If you're a Python programmer, we hope that Mojo is immediately familiar, while also providing new tools to develop safe and performant systems-level code that would otherwise require C and C++ below Python. We aren't trying to convince the world that "static is best" or "dynamic is best." Rather, we believe that both are good when used for the right applications, so we designed Mojo to allow you, the programmer, to decide when to use static or dynamic. ### Why we chose Python Python is the dominant force in ML and countless other fields. It's easy to learn, known by important cohorts of programmers, has an amazing community, has tons of valuable packages, and has a wide variety of good tooling. Python supports the development of beautiful and expressive APIs through its dynamic programming features, which led machine learning frameworks like TensorFlow and PyTorch to embrace Python as a frontend to their high-performance runtimes implemented in C++. For Modular today, Python is a non-negotiable part of our API surface stack—this is dictated by our customers. Given that everything else in our stack is negotiable, it stands to reason that we should start from a "Python-first" approach. More subjectively, we believe that Python is a beautiful language. It's designed with simple and composable abstractions, it eschews needless punctuation that is redundant-in-practice with indentation, and it's built with powerful (dynamic) metaprogramming features. All of which provide a runway for us to extend the language to what we need at Modular. We hope that people in the Python ecosystem see our direction for Mojo as taking Python ahead to the next level—completing it—instead of competing with it. ## Compatibility with Python We plan for full compatibility with the Python ecosystem, but there are actually two types of compatibility, so here's where we currently stand on them both: * In terms of your ability to import existing Python modules and use them in a Mojo program, Mojo is 100% compatible because we use CPython for interoperability. * In terms of your ability to migrate any Python code to Mojo, it's not fully compatible yet. Mojo already supports many core features from Python, including async/await, error handling, variadics, and so on. However, Mojo is still young and missing many other features from Python. Mojo doesn't even support classes yet! There is a lot of work to be done, but we're confident we'll get there, and we're guided by our team's experience building other major technologies with their own compatibility journeys: * The journey to the [Clang compiler](https://clang.llvm.org/) (a compiler for C, C++, Objective-C, CUDA, OpenCL, and others), which is a "compatible replacement" for GCC, MSVC and other existing compilers. It is hard to make a direct comparison, but the complexity of the Clang problem appears to be an order of magnitude bigger than implementing a compatible replacement for Python. * The journey to the [Swift programming language](https://www.swift.org/), which embraced the Objective-C runtime and language ecosystem, and progressively migrated millions of programmers (and huge amounts of code). With Swift, we learned lessons about how to be "run-time compatible" and cooperate with a legacy runtime. In situations where you want to mix Python and Mojo code, we expect Mojo to cooperate directly with the CPython runtime and have similar support for integrating with CPython classes and objects without having to compile the code itself. This provides plug-in compatibility with a massive ecosystem of existing code, and it enables a progressive migration approach in which incremental migration to Mojo yields incremental benefits. Overall, we believe that by focusing on language design and incremental progress towards full compatibility with Python, we will get where we need to be in time. However, it's important to understand that when you write pure Mojo code, there is nothing in the implementation, compilation, or runtime that uses any existing Python technologies. On its own, it is an entirely new language with an entirely new compilation and runtime system. ### Intentional differences from Python While Python compatibility and migratability are key to Mojo's success, we also want Mojo to be a first-class language (meaning that it's a standalone language rather than dependent upon another language). It should not be limited in its ability to introduce new keywords or grammar productions merely to maintain compatibility. As such, our approach to compatibility is two-fold: 1. We utilize CPython to run all existing Python 3 code without modification and use its runtime, unmodified, for full compatibility with the entire ecosystem. Running code this way provides no benefit from Mojo, but the sheer existence and availability of this ecosystem will rapidly accelerate the bring-up of Mojo, and leverage the fact that Python is really great for high-level programming already. 2. We will provide a mechanical migration tool that provides very good compatibility for people who want to migrate code from Python to Mojo. For example, to avoid migration errors with Python code that uses identifier names that match Mojo keywords, Mojo provides a backtick feature that allows any keyword to behave as an identifier. Together, this allows Mojo to integrate well in a mostly-CPython world, but allows Mojo programmers to progressively move code (a module or file at a time) to Mojo. This is a proven approach from the Objective-C to Swift migration that Apple performed. It will take some time to build the rest of Mojo and the migration support, but we are confident that this strategy allows us to focus our energies and avoid distractions. We also think the relationship with CPython can build in both directions—wouldn't it be cool if the CPython team eventually reimplemented the interpreter in Mojo instead of C? 🔥 ## Python's problems By aiming to make Mojo the best way to extend Python, we believe we can solve many of Python's existing problems. Python has some well-known problems—most obviously, poor low-level performance and CPython implementation details like the global interpreter lock (GIL), which makes Python single-threaded. While there are many active projects underway to improve these challenges, the issues brought by Python go deeper and are particularly impactful in the AI field. Instead of talking about those technical limitations in detail, we'll talk about their implications here in the present. Note that everywhere we refer to Python in this section is referring to the CPython implementation. We'll talk about other implementations later. ### The two-world problem For a variety of reasons, Python isn't suitable for systems programming. Fortunately, Python has amazing strengths as a glue layer, and low-level bindings to C and C++ allow building libraries in C, C++ and many other languages with better performance characteristics. This is what has enabled things like NumPy, TensorFlow, PyTorch, and a vast number of other libraries in the ecosystem. Unfortunately, while this approach is an effective way to build high-performance Python libraries, it comes with a cost: building these hybrid libraries is very complicated. It requires low-level understanding of the internals of CPython, requires knowledge of C/C++ (or other) programming (undermining one of the original goals of using Python in the first place), makes it difficult to evolve large frameworks, and (in the case of ML) pushes the world towards "graph based" programming models, which have worse fundamental usability than "eager mode" systems. Both TensorFlow and PyTorch have faced significant challenges in this regard. Beyond the fundamental nature of how the two-world problem creates system complexity, it makes everything else in the ecosystem more complicated. Debuggers generally can't step across Python and C code, and those that can aren't widely accepted. It's painful that the Python package ecosystem has to deal with C/C++ code in addition to Python. Projects like PyTorch, with significant C++ investments, are intentionally trying to move more of their codebase to Python because they know it gains usability. ### The three-world and N-world problem The two-world problem is commonly felt across the Python ecosystem, but things are even worse for developers of machine learning frameworks. AI is pervasively accelerated, and those accelerators use bespoke programming languages like CUDA. While CUDA is a relative of C++, it has its own special problems and limitations, and it does not have consistent tools like debuggers or profilers. It is also effectively locked into a single hardware maker. The AI world has an incredible amount of innovation on the hardware front, and as a consequence, complexity is spiraling out of control. There are now several attempts to build limited programming systems for accelerators (OpenCL, Sycl, OneAPI, and others). This complexity explosion is continuing to increase and none of these systems solve the fundamental fragmentation in the tools and ecosystem that is hurting the industry so badly—they're *adding to the fragmentation*. ### Mobile and server deployment Another challenge for the Python ecosystem is deployment. There are many facets to this, including how to control dependencies, how to deploy hermetically compiled "a.out" files, and how to improve multi-threading and performance. These are areas where we would like to see the Python ecosystem take significant steps forward. ## Related work We are aware of many other efforts to improve Python, but they do not solve the [fundamental problem](#mlir) we aim to solve with Mojo. Some ongoing efforts to improve Python include work to speed up Python and replace the GIL, to build languages that look like Python but are subsets of it, and to build embedded domain-specific languages (DSLs) that integrate with Python but which are not first-class languages. While we cannot provide an exhaustive list of all the efforts, we can talk about some challenges faced in these projects, and why they don't solve the problems that Mojo does. ### Improving CPython and JIT compiling Python Recently, the community has spent significant energy on improving CPython performance and other implementation issues, and this is showing huge results. This work is fantastic because it incrementally improves the current CPython implementation. For example, Python 3.11 has increased performance 10-60% over Python 3.10 through internal improvements, and [Python 3.12](https://github.com/faster-cpython/ideas/wiki/Python-3.12-Goals) aims to go further with a trace optimizer. [Python 3.13](https://github.com/faster-cpython/ideas/blob/main/3.13/README.md) adds a [JIT compiler](https://peps.python.org/pep-0744/) to CPython, enables the use of [multiple subinterpreters](https://peps.python.org/pep-0554/) in a single Python process (thus sidestepping the GIL) and speeds up memory management. Many other projects are attempting to tame the GIL, and projects like PyPy (among many others) have used JIT compilation and tracing approaches to speed up Python. While we are fans of these great efforts, and feel they are valuable and exciting to the community, they unfortunately do not satisfy our needs at Modular, because they do not help provide a unified language onto an accelerator. Many accelerators these days support very limited dynamic features, or do so with terrible performance. Furthermore, systems programmers don't seek only "performance," but they also typically want a lot of **predictability and control** over how a computation happens. We are looking to eliminate the need to use C or C++ within Python libraries, we seek the highest performance possible, and we cannot accept dynamic features at all in some cases. Therefore, these approaches don't help. ### Python subsets and other Python-like languages There are many attempts to build a "deployable" Python, such as TorchScript from the PyTorch project. These are useful because they often provide low-dependency deployment solutions and sometimes have high performance. Because they use Python-like syntax, they can be easier to learn than a novel language. On the other hand, these languages have not seen wide adoption—because they are a subset of Python, they generally don't interoperate with the Python ecosystem, don't have fantastic tooling (such as debuggers), and often change-out inconvenient behavior in Python unilaterally, which breaks compatibility and fragments the ecosystem further. For example, many of these change the behavior of simple integers to wrap instead of producing Python-compatible math. The challenge with these approaches is that they attempt to solve a weak point of Python, but they aren't as good at Python's strong points. At best, these can provide a new alternative to C and C++, but without solving the dynamic use-cases of Python, they cannot solve the "two world problem." This approach drives fragmentation, and incompatibility makes *migration* difficult to impossible—recall how challenging it was to migrate from Python 2 to Python 3. ### Python family languages with C compatibility Because Mojo is designed to adopt the syntax of Python with improved systems programming capabilities, it shares some high-level ideas with other members of the Python family of languages like [Pyrex](https://wiki.python.org/moin/Pyrex) and [Cython](https://cython.org/). Like Mojo, these projects define their own language while also supporting the Python language. They allow you to write more performant extensions for Python that interoperate with both Python and C libraries. These Python family languages are great for some kinds of applications, and they've been applied to great effect by some popular Python libraries. However, they don't solve [Python's two-world problem](#the-two-world-problem) and because they rely on CPython for their core semantics, they can't work without it, whereas Mojo uses CPython only when necessary to provide [compatibility with existing Python code](#compatibility-with-python). Pure Mojo code does not use any pre-existing runtime or compiler technologies, it instead uses an [MLIR-based infrastructure](#mlir) to enable high-performance execution on a wide range of hardware. ### Embedded DSLs in Python Another common approach is to build embedded domain-specific languages (DSLs) in Python, typically installed with a Python decorator. There are many examples of this (the `@tf.function` decorator in TensorFlow, the `@triton.jit` in OpenAI's Triton programming model, etc.). A major benefit of these systems is that they maintain compatibility with the Python ecosystem of tools, and integrate natively into Python logic, allowing an embedded mini language to co-exist with the strengths of Python for dynamic use cases. Unfortunately, the embedded mini-languages provided by these systems often have surprising limitations, don't integrate well with debuggers and other workflow tooling, and do not support the level of native language integration that we seek for a language that unifies heterogeneous compute and is the primary way to write large-scale kernels and systems. With Mojo, we hope to move the usability of the overall system forward by simplifying things and making it more consistent. Embedded DSLs are an expedient way to get demos up and running, but we are willing to put in the additional effort and work to provide better usability and predictability for our use-case. To learn about what we've built with Mojo so far, see the [Mojo Manual](/mojo/manual/). --- ## WorkInfo `@register_passable(trivial)` `struct WorkInfo` ## Fields * ​m (`SIMD[uint32, 1]`): * ​n (`SIMD[uint32, 1]`): * ​k\_start (`SIMD[uint32, 1]`): * ​num\_k\_tiles (`SIMD[uint32, 1]`): * ​is\_valid\_tile (`Bool`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `is_valid` `is_valid(self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## WorkInfo `@register_passable(trivial)` `struct WorkInfo` ## Fields * ​prompt\_offset (`SIMD[uint32, 1]`): * ​head\_idx (`SIMD[uint32, 1]`): * ​prompt\_idx (`SIMD[uint32, 1]`): * ​is\_valid\_tile (`Bool`): ## Implemented traits `AnyType`, `Copyable`, `ExplicitlyCopyable`, `Movable`, `Stringable`, `UnknownDestructibility`, `Writable` ## Methods ### `is_valid` `is_valid(self) -> Bool` ### `__str__` `__str__(self) -> String` ### `write_to` `write_to[W: Writer](self, mut writer: W)` --- ## Writable The `Writable` trait describes how a type is written into a `Writer`. You must implement `write_to` which takes `self` and a type conforming to `Writer`: ```mojo struct Point(Writable): var x: Float64 var y: Float64 fn write_to[W: Writer](self, mut writer: W): var string = "Point" # Write a single `Span[Byte]`: writer.write_bytes(string.as_bytes()) # Pass multiple args that can be converted to a `Span[Byte]`: writer.write("(", self.x, ", ", self.y, ")") ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `write_to` `write_to[W: Writer](self: _Self, mut writer: W)` Formats the string representation of this type to the provided Writer. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The type conforming to `Writable`. --- ## WritableVariadicPack `@register_passable` `struct WritableVariadicPack[mut: Bool, //, is_owned: Bool, origin: Origin[mut], pack_origin: Origin[mut], *Ts: Writable]` Wraps a `VariadicPack`, enabling it to be passed to a writer along with extra arguments. Example: ```mojo from utils.write import WritableVariadicPack fn foo[*Ts: Writable](*messages: *Ts): print("message:", WritableVariadicPack(messages), "[end]") x = 42 foo("'x = ", x, "'") ``` Output: ```text message: 'x = 42' [end] ``` ## Parameters * ​mut (`Bool`): Whether the origin is mutable. * ​is\_owned (`Bool`): Whether the `VariadicPack` owns its elements. * ​origin (`Origin[mut]`): The origin of the reference to the `VariadicPack`. * ​pack\_origin (`Origin[mut]`): The origin of the `VariadicPack`. * ​\*Ts (`Writable`): The types of the variadic arguments conforming to `Writable`. ## Fields * ​value (`Pointer[VariadicPack[is_owned, pack_origin, Writable, Ts], origin]`): Reference to a `VariadicPack` that conforms to `Writable`. ## Implemented traits `AnyType`, `UnknownDestructibility`, `Writable` ## Methods ### `__init__` `__init__(ref [origin] value: VariadicPack[is_owned, pack_origin, Writable, Ts]) -> Self` Initialize using a reference to the `VariadicPack`. **Args:** * ​value (`VariadicPack[is_owned, pack_origin, Writable, Ts]`): The `VariadicPack` to take a reference to. ### `write_to` `write_to[W: Writer](self, mut writer: W)` Formats the string representation of all the arguments in the `VariadicPack` to the provided `Writer`. **Parameters:** * ​W (`Writer`): A type conforming to the Writable trait. **Args:** * ​writer (`W`): The type conforming to `Writable`. --- ## write Establishes the contract between `Writer` and `Writable` types. ## Structs * [​`WritableVariadicPack`](/mojo/stdlib/utils/write/WritableVariadicPack): Wraps a `VariadicPack`, enabling it to be passed to a writer along with extra arguments. ## Traits * [​`Writable`](/mojo/stdlib/utils/write/Writable): The `Writable` trait describes how a type is written into a `Writer`. * [​`Writer`](/mojo/stdlib/utils/write/Writer): Describes a type that can be written to by any type that implements the `write_to` function. ## Functions * [​`write_args`](/mojo/stdlib/utils/write/write_args): Add separators and end characters when writing variadics into a `Writer`. * [​`write_buffered`](/mojo/stdlib/utils/write/write_buffered): Use a buffer on the stack to minimize expensive calls to the writer. When the buffer would overflow it writes to the `writer` passed in. You can also add separators between the args, and end characters. The default stack space used for the buffer is 4096 bytes which matches the default arm64 and x86-64 page size, you can modify this e.g. when writing a large amount of data to a file. --- ## write_args `write_args[W: Writer, *Ts: Writable](mut writer: W, args: VariadicPack[is_owned, origin, Writable, Ts], *, sep: StringSlice[StaticConstantOrigin] = StringSlice(), end: StringSlice[StaticConstantOrigin] = StringSlice())` Add separators and end characters when writing variadics into a `Writer`. Example ```mojo import sys from utils import write_args fn variadic_pack_function[*Ts: Writable]( *args: *Ts, sep: StaticString, end: StaticString ): var stdout = sys.stdout write_args(stdout, args, sep=sep, end=end) variadic_pack_function(3, "total", "args", sep=",", end="[end]") ``` ``` 3, total, args[end] ``` . **Parameters:** * ​W (`Writer`): The type of the `Writer` to write to. * ​\*Ts (`Writable`): The types of each arg to write. Each type must satisfy `Writable`. **Args:** * ​writer (`W`): The `Writer` to write to. * ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. --- ## write_buffered `write_buffered[W: Writer, //, *Ts: Writable, *, buffer_size: Int = 4096, use_heap: Bool = False](mut writer: W, args: VariadicPack[is_owned, origin, Writable, Ts], *, sep: StringSlice[StaticConstantOrigin] = StringSlice(), end: StringSlice[StaticConstantOrigin] = StringSlice())` Use a buffer on the stack to minimize expensive calls to the writer. When the buffer would overflow it writes to the `writer` passed in. You can also add separators between the args, and end characters. The default stack space used for the buffer is 4096 bytes which matches the default arm64 and x86-64 page size, you can modify this e.g. when writing a large amount of data to a file. Example ```mojo import sys from utils import write_buffered fn print_err_buffered[*Ts: Writable]( *args: *Ts, sep: StaticString, end: StaticString ): var stderr = sys.stderr write_buffered(stderr, args, sep=sep, end=end) # Buffer before allocating a string var string = String() write_buffered(string, args, sep=sep, end=end) print_err_buffered(3, "total", "args", sep=",", end="[end]") ``` ``` 3, total, args[end] ``` . **Parameters:** * ​W (`Writer`): The type of the `Writer` to write to. * ​\*Ts (`Writable`): The types of each arg to write. Each type must satisfy `Writable`. * ​buffer\_size (`Int`): How many bytes to write to a buffer before writing out to the `writer` (default `4096`). * ​use\_heap (`Bool`): Buffer to the heap, first calculating the total byte size of all the args and then allocating only once. `buffer_size` is not used in this case as it's dynamically calculated. (default `False`). **Args:** * ​writer (`W`): The `Writer` to write to. * ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. * ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements. `write_buffered[W: Writer, T: Copyable & Movable & Writable, //, buffer_size: Int = 4096](mut writer: W, values: List[T, hint_trivial_type], *, sep: StringSlice[StaticConstantOrigin] = StringSlice())` Use a buffer on the stack to minimize expensive calls to the writer. You can also add separators between the values. The default stack space used for the buffer is 4096 bytes which matches the default arm64 and x86-64 page size, you can modify this e.g. when writing a large amount of data to a file. Example ```mojo import sys from utils import write_buffered var string = String() var values = List[String]("3", "total", "args") write_buffered(string, values, sep=",") ``` ``` 3, total, args ``` . **Parameters:** * ​W (`Writer`): The type of the `Writer` to write to. * ​T (`Copyable & Movable & Writable`): The element type of the `List`. Must implement the `Writable`, `Copyable` and `Movable` traits. * ​buffer\_size (`Int`): How many bytes to write to a buffer before writing out to the `writer` (default `4096`). **Args:** * ​writer (`W`): The `Writer` to write to. * ​values (`List[T, hint_trivial_type]`): A `List` of Writable arguments. * ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements. --- ## Writer Describes a type that can be written to by any type that implements the `write_to` function. This enables you to write one implementation that can be written to a variety of types such as file descriptors, strings, network locations etc. The types are written as a `Span[Byte]`, so the `Writer` can avoid allocations depending on the requirements. There is also a general `write` that takes multiple args that implement `write_to`. Example: ```mojo from memory import Span @value struct NewString(Writer, Writable): var s: String # Writer requirement to write a Span of Bytes fn write_bytes(mut self, bytes: Span[Byte, _]): self.s._iadd(bytes) # Writer requirement to take multiple args fn write[*Ts: Writable](mut self, *args: *Ts): @parameter fn write_arg[T: Writable](arg: T): arg.write_to(self) args.each[write_arg]() # Also make it Writable to allow `print` to write the inner String fn write_to[W: Writer](self, mut writer: W): writer.write(self.s) @value struct Point(Writable): var x: Int var y: Int # Pass multiple args to the Writer. The Int and StaticString types # call `writer.write_bytes` in their own `write_to` implementations. fn write_to[W: Writer](self, mut writer: W): writer.write("Point(", self.x, ", ", self.y, ")") # Enable conversion to a String using `String(point)` fn __str__(self) -> String: return String.write(self) fn main(): var point = Point(1, 2) var new_string = NewString(String(point)) new_string.write("\n", Point(3, 4)) print(new_string) ``` Output: ```plaintext Point(1, 2) Point(3, 4) ``` ## Implemented traits `AnyType`, `UnknownDestructibility` ## Methods ### `write_bytes` `write_bytes(mut self: _Self, bytes: Span[SIMD[uint8, 1], origin])` Write a `Span[Byte]` to this `Writer`. **Args:** * ​bytes (`Span[SIMD[uint8, 1], origin]`): The string slice to write to this Writer. Must NOT be null-terminated. ### `write` `write[*Ts: Writable](mut self: _Self, *args: *Ts)` Write a sequence of Writable arguments to the provided Writer. **Parameters:** * ​\*Ts (`Writable`): Types of the provided argument sequence. **Args:** * ​\*args (`*Ts`): Sequence of arguments to write to this Writer. --- ## y0 `y0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the second kind of order 0 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## y1 `y1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]` Computes the Bessel function of the second kind of order 1 for each input value. **Constraints:** The input must be a floating-point type. **Parameters:** * ​dtype (`DType`): The `dtype` of the input and output SIMD vector. * ​width (`Int`): The width of the input and output SIMD vector. **Args:** * ​x (`SIMD[dtype, width]`): The input vector. **Returns:** A vector containing the computed value for each value in the input. --- ## zip `zip[origin: ImmutableOrigin, n: Int](ts: InlineArray[Pointer[IntTuple, origin], n]) -> _zip[origin, n]` Create a zip iterator from an array of `IntTuple` pointers. This function creates a zip iterator that allows simultaneous traversal of multiple `IntTuple` collections. **Parameters:** * ​origin (`ImmutableOrigin`): The origin tracking parameter for memory safety. * ​n (`Int`): The number of `IntTuple` collections being zipped together. **Args:** * ​ts (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to the `IntTuple` collections to zip. **Returns:** A `_zip` object that can be iterated over. `zip(a: IntTuple[origin], b: IntTuple[origin], out result: _zip[{a, b}, 2])` Create a zip iterator for two `IntTuple`s. This function creates a zip iterator that allows simultaneous traversal of two `IntTuple`s, yielding pairs of corresponding elements. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to zip. * ​b (`IntTuple[origin]`): Second `IntTuple` to zip. **Returns:** The resulting zip iterator for the input `IntTuple`s. `zip(a: IntTuple[origin], b: IntTuple[origin], c: IntTuple[origin], out result: _zip[{a, b, c}, 3])` Create a zip iterator for three `IntTuple`s. This function creates a zip iterator that allows simultaneous traversal of three `IntTuple`s, yielding triplets of corresponding elements. **Args:** * ​a (`IntTuple[origin]`): First `IntTuple` to zip. * ​b (`IntTuple[origin]`): Second `IntTuple` to zip. * ​c (`IntTuple[origin]`): Third `IntTuple` to zip. **Returns:** The resulting zip iterator for the input `IntTuple`s. --- ## zip_modes `zip_modes(layout_a: Layout, layout_b: Layout) -> Layout` Combines corresponding modes from two layouts. This function creates a new layout by combining corresponding dimensions from two layouts. If a dimension in layout\_b has a non-positive shape, the corresponding dimension from layout\_a is used directly. **Args:** * ​layout\_a (`Layout`): The first layout. * ​layout\_b (`Layout`): The second layout. **Returns:** A new layout with combined dimensions from both input layouts. --- ## zipped_divide `zipped_divide(layout_a: Layout, layout_b: Layout) -> Layout` Divides a layout into blocks according to another layout. This function creates a hierarchical layout by dividing the first layout according to the second layout. It's an alias for hierarchical\_unzip that provides a more intuitive name for the division operation. This is useful for creating blocked or tiled representations of tensors. Example: ```mojo from layout import Layout, IntTuple from layout.layout import zipped_divide # Create layouts var base = Layout.row_major(6, 8) var pattern = Layout(IntTuple(2, 2)) var result = zipped_divide(base, pattern) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​layout\_b (`Layout`): The layout defining the division pattern. **Returns:** A new layout representing the hierarchical division of layout\_a according to layout\_b. `zipped_divide(layout_a: Layout, tiler: List[Layout]) -> Layout` Divides a layout into blocks according to a list of layouts. This function creates a hierarchical layout by dividing the first layout according to the layouts in the tiler list. It's an alias for hierarchical\_unzip that provides a more intuitive name for the division operation when working with multiple tiling patterns. Example: ```mojo from layout import Layout, LayoutList, IntTuple from layout.layout import zipped_divide # Create layouts var base = Layout.row_major(6, 8) var tilers = LayoutList() tilers.append(Layout(IntTuple(2, 2))) var result = zipped_divide(base, tilers) ``` . **Args:** * ​layout\_a (`Layout`): The layout to be divided. * ​tiler (`List[Layout]`): A list of layouts defining the division patterns. **Returns:** A new layout representing the hierarchical division of layout\_a according to the patterns in tiler.